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Preface 


This book is about the growing intersection of data-driven methods, machine 
learning, applied optimization, and the classical fields of engineering mathe- 
matics and mathematical physics. We developed this material over a number of 
years, primarily to educate our advanced undergraduate and beginning grad- 
uate students from engineering and physical science departments. Typically, 
such students have backgrounds in linear algebra, differential equations, and 
scientific computing, with engineers often having some exposure to control the- 
ory and/or partial differential equations. However, most undergraduate curric- 
ula in engineering and science fields have little or no exposure to data methods 
and/or optimization. Likewise, computer scientists and statisticians have little 
exposure to dynamical systems and control. Our goal is to provide a broad en- 
try point to applied machine learning for both of these groups of students. We 
have chosen the methods discussed in this book for their (1) relevance, (2) sim- 
plicity, and (3) generality, and we have attempted to present a range of topics, 
from basic introductory material up to research-level techniques. 

Data-driven discovery is currently revolutionizing how we model, predict, 
and control complex systems. The most pressing scientific and engineering 
problems of the modern era are not amenable to empirical models or deriva- 
tions based on first principles. Increasingly, researchers are turning to data- 
driven approaches for a diverse range of complex systems, such as turbulence, 
the brain, climate, epidemiology, finance, robotics, and autonomy. These sys- 
tems are typically nonlinear, dynamic, multi-scale in space and time, and high- 
dimensional, with dominant underlying patterns that should be characterized 
and modeled for the eventual goal of sensing, prediction, estimation, and con- 
trol. With modern mathematical methods, enabled by the unprecedented avail- 
ability of data and computational resources, we are now able to tackle previ- 
ously unattainable problems. A small handful of these new techniques include 
robust image reconstruction from sparse and noisy random pixel measure- 
ments, turbulence control with machine learning, optimal sensor and actuator 
placement, discovering interpretable nonlinear dynamical systems purely from 
data, and reduced-order models to accelerate the optimization and control of 
systems with complex multi-scale physics. 

Driving modern data science is the availability of vast and increasing quan- 
tities of data, enabled by remarkable innovations in low-cost sensors, orders- 
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of-magnitude increases in computational power, and virtually unlimited data 
storage and transfer capabilities. Such vast quantities of data are affording en- 
gineers and scientists across all disciplines new opportunities for data-driven 
discovery, which has been referred to as the fourth paradigm of scientific dis- 
covery [826]. This fourth paradigm is the natural culmination of the first three 
paradigms: empirical experimentation, analytical derivation, and computational 
investigation. The integration of these techniques provides a transformative 
framework for data-driven discovery efforts. This process of scientific discov- 
ery is not new, and indeed mimics the efforts of leading figures of the scien- 
tific revolution: Johannes Kepler (1571-1630) and Sir Isaac Newton (1642-1727). 
Each played a critical role in developing the theoretical underpinnings of celes- 
tial mechanics, based on a combination of empirical data-driven and analytical 
approaches. Data science is not replacing mathematical physics and engineer- 
ing, but is instead augmenting it for the twenty-first century, resulting in more 
of a renaissance than a revolution. 

Data science itself is not new, having been proposed more than 50 years ago 
by John Tukey, who envisioned the existence of a scientific effort focused on 
learning from data, or data analysis [205]. Since that time, data science has been 
largely dominated by two distinct cultural outlooks on data [109]. The machine 
learning community, which predominantly comprises computer scientists, is 
typically centered on prediction quality and scalable, fast algorithms. Although 
not necessarily in contrast, the statistical learning community, often centered in 
statistics departments, focuses on the inference of interpretable models. Both 
methodologies have achieved significant success and have provided the math- 
ematical and computational foundations for data science methods. For engi- 
neers and scientists, the goal is to leverage these broad techniques to infer and 
compute models (typically nonlinear) from observations that correctly iden- 
tify the underlying dynamics and generalize qualitatively and quantitatively to 
unmeasured parts of phase, parameter, or application space. Our goal in this 
book is to leverage the power of both statistical and machine learning to solve 
engineering problems. 


Themes of This Book 


There are a number of key themes that have emerged throughout this book. 
First, many complex systems exhibit dominant low-dimensional patterns in the 
data, despite the rapidly increasing resolution of measurements and compu- 
tations. This underlying structure enables efficient sensing, and compact rep- 
resentations for modeling and control. Pattern extraction is related to the sec- 
ond theme of finding coordinate transforms that simplify the system. Indeed, the 
rich history of mathematical physics is centered around coordinate transforma- 
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tions (e.g., spectral decompositions, the Fourier transform, generalized func- 
tions, etc.), although these techniques have largely been limited to simple ide- 
alized geometries and linear dynamics. The ability to derive data-driven trans- 
formations opens up opportunities to generalize these techniques to new re- 
search problems with more complex geometries and boundary conditions. We 
also take the perspective of dynamical systems and control throughout the book, 
applying data-driven techniques to model and control systems that evolve in 
time. Perhaps the most pervasive theme is that of data-driven applied optimiza- 
tion, as nearly every topic discussed is related to optimization (e.g., finding 
optimal low-dimensional patterns, optimal sensor placement, machine learning 
optimization, optimal control, etc.). Even more fundamentally, most data is orga- 
nized into arrays for analysis, where the extensive development of numerical 
linear algebra tools from the early 1960s onward provides many of the foun- 
dational mathematical underpinnings for matrix decompositions and solution 
strategies used throughout this text. 


Overview of Second Edition 


The integration of machine learning methods in science and engineering has 
advanced significantly in the two years since publication of the first edition. 
The field is fast-moving, with innovations coming in a diversity of application 
areas that use creative mathematical architectures for advancing the state of the 
art in data-driven modeling and control. This second edition is aimed at captur- 
ing some of the more salient and successful advancements in the field. It helps 
bring the reader to a modern understanding of what is possible using machine 
learning in science and engineering. As with the first edition, extensive online 
supplementary material can be found at the book’s website: 


http://databookuw.com 
Major changes in the second edition include the following. 


e Homework: Extensive homework has been added to every chapter, with 
additional homework and projects on the book’s website. Homework ranges 
in difficulty from introductory demonstrations and concept-building to 
advanced problems that reproduce modern research papers and may be 
the basis of course projects. 


e Code: Python code has been added throughout, in parallel to existing 
MATLAB code, and both sets of codes have been streamlined consider- 
ably. All extended codes are available in MATLAB and Python on the 
book’s website and GitHub pages. 
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- Python Code: 
https://github.com/dynamicslab/databook_python 


— MATLAB Code: 
https://github.com/dynamicslab/databook_matlab 


Wherever possible, a minimal representation of code has been presented 
in the text to improve readability. These code blocks are equivalently ex- 
pressed in MATLAB and Python. In more advanced examples, it is often 
advantageous to use either MATLAB or Python, but not both. In such 
cases, this has been indicated and only a single code block is demon- 
strated. The full code is available at the above GitHub sites as well as on 
the book’s website. In addition, extensive codes are available in R online. 
We encourage the reader to read the book and follow along with code to 
help improve the learning process and experience. 


New chapters: Two new chapters have been added on “Reinforcement 
Learning” and “Physics-Informed Machine Learning,” which are two of 
the most exciting and rapidly growing fields of research in machine learn- 
ing, modeling, and control. 


- Reinforcement Learning: Reinforcement learning is a third major branch 


of machine learning that is concerned with how to learn control laws 
and policies to interact with a complex environment. This is a crit- 
ical area of research, situated at the growing intersection of control 
theory and machine learning. 


- Physics-Informed Machine Learning: The integration of physics con- 
cepts, constraints, and symmetries is providing exceptional opportu- 
nities for training machine learning algorithms that are encoded with 
knowledge of physics. This chapter features a number of recent in- 
novations aimed at understanding how this can be done in principle 
and in practice. 


e New sections: We have added and improved material throughout, in- 


cluding the following. 


- Chapter 1: new sections discussing condition number, connections to 
the eigendecomposition, and error bounds for SVD (singular value 
decomposition) based approximations. 


— Chapter 2: new section on the Laplace transform. 


- Chapter 6: new sections devoted to autoencoders, recurrent neural 
networks and generative adversarial networks. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


Preface xiii 


— Chapter 7: addition of recent innovations to DMD (dynamic mode 
decomposition), Koopman theory, and SINDy (sparse identification 
of nonlinear dynamics). 


— Chapter 10: new section on model predictive control. 


- Chapter 12 (previously Chapter 11): new sections on using neural 
networks for time-stepping in reduced-order models, as well as non- 
intrusive methods such as DMD. 


— Chapter 13 (previously Chapter 12): new sections on decoder net- 
works for interpolation in model reduction as well as randomized 
linear algebra methods for scalable reduced-order models. 


e Videos: An extensive collection of video lectures are available on YouTube, 
covering nearly every topic from each section of the book. Videos may be 
found on our YouTube channels. 


— www. youtube.com/c/eigensteve 
— www. youtube.com/c/Nathankut zAMATH 
— www. youtube.com/c/PhysicsInformedMachineLearning 


e Typos: We have corrected typos and mistakes throughout the second edi- 
tion. 


Online Material 


We have designed this book to make extensive use of online supplementary 
material, including codes, data, videos, homework, and suggested course syl- 
labi. All of this material can be found at the book’s website: http: //dat abookuw. 

In addition to course resources, all of the code and data used in the book 
are available on the book’s GitHub: https: //github.com/dynamicslab/| 
The codes online are more extensive than those presented in the book, includ- 
ing code used to generate publication-quality figures. In addition to the Python 
and MATLAB used throughout the text, online code is also available in R. Data 
visualization was ranked as the top-used data science method in the Kaggle 
2017 The State of Data Science and Machine Learning study, and so we highly 
encourage readers to download the online codes and make full use of these 
plotting commands. 

We have also recorded and posted video lectures on YouTube for every sec- 


tion in this book, available at www. youtube .com/c/eigensteveland www. | 
youtube.com/c/NathanKut zAMATH, We include supplementary videos for 
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students to fill in gaps in their background on scientific computing and founda- 
tional applied mathematics. We have designed this text to be both a reference 
as well as the material for several courses at various levels of student prepara- 
tion. Most chapters are also modular, and may be converted into stand-alone 
boot camps, containing roughly 10 hours of materials each. 


How to Use This Book 


Our intended audience includes beginning graduate students, or advanced un- 
dergraduates, in engineering and science. As such, the machine learning meth- 
ods are introduced at a beginning level, whereas we assume students know 
how to model physical systems with differential equations and simulate them 
with solvers such as ode45. The diversity of topics covered thus range from in- 
troductory to state-of-the-art research methods. Our aim is to provide an inte- 
grated viewpoint and mathematical toolset for solving engineering and science 
problems. Alternatively, the book can also be useful for computer science and 
statistics students, who often have limited knowledge of dynamical systems 
and control. Various courses can be designed from this material, and several 
example syllabi may be found on the book’s website — this includes homework, 
data sets, and code. 

First and foremost, we want this book to be fun, inspiring, eye-opening, and 
empowering for young scientists and engineers. We have attempted to make 
everything as simple as possible, while still providing the depth and breadth 
required to be useful in research. Many of the chapter topics in this text could 
be entire books in their own right, and many of them are. However, we also 
wanted to be as comprehensive as may be reasonably expected for a field that 
is so big and moving so fast. We hope that you enjoy this book, master these 
methods, and change the world with applied data science! 
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Most Common Optimization Strategies 


Least-squares (discussed in Chapters|1|and|4) minimizes the sum of the squares 
of the residuals between a given fitting model and data. Linear least-squares, 
where the residuals are linear in the unknowns, has a closed-form solution 
which can be computed by taking the derivative of the residual with respect 
to each unknown and setting it to zero. It is commonly used in the engineering 
and applied sciences for fitting polynomial functions. Nonlinear least-squares 
typically requires iterative refinement based upon approximating the nonlinear 
least-squares with a linear least-squares at each iteration. 


Gradient descent (discussed in Chapters |4|and|6) is the industry-leading, con- 
vex optimization method for high-dimensional systems. It minimizes residuals 
by computing the gradient of a given fitting function. The iterative procedure 
updates the solution by moving downhill in the residual space. The Newton- 
Raphson method is a one-dimensional version of gradient descent. Since it is 
often applied in high-dimensional settings, it is prone to find only local min- 
ima. Critical innovations for big data applications include stochastic gradient 
descent and the backpropagation algorithm, which makes the optimization 
amenable to computing the gradient itself. 


Alternating descent method (ADM) (discussed in Chapter |4) avoids compu- 
tations of the gradient by optimizing in one unknown at a time. Thus all un- 
knowns are held constant while a line search (non-convex optimization) can be 
performed in a single variable. This variable is then updated and held constant 
while another of the unknowns is updated. The iterative procedure continues 
through all unknowns and the iteration procedure is repeated until a desired 
level of accuracy is achieved. 


Augmented Lagrange method (ALM) (discussed in Chapters[3|and|8) is aclass 
of algorithms for solving constrained optimization problems. They are similar 
to penalty methods in that they replace a constrained optimization problem 
by a series of unconstrained problems and add a penalty term to the objective 
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which helps enforce the desired constraint. ALM adds another term designed 
to mimic a Lagrange multiplier. The augmented Lagrangian is not the same as 
the method of Lagrange multipliers. 


Linear program and simplex method are the workhorse algorithms for con- 
vex optimization. A linear program has an objective function which is linear 
in the unknown, and the constraints consist of linear inequalities and equali- 
ties. By computing its feasible region, which is a convex polytope, the linear 
programming algorithm finds a point in the polyhedron where this function 
has the smallest (or largest) value if such a point exists. The simplex method 
is a specific iterative technique for linear programs which aims to take a given 
basic feasible solution to another basic feasible solution for which the objective 
function is smaller, thus producing an iterative procedure for optimizing. 


Most Common Equations and Symbols 


Linear Algebra 
Linear System of Equations 


Ax= b. 


The matrix A € R?’*" and vector b € R’ are generally known, and the vector 
x € R” is unknown. 


Eigenvalue Equation 
AT = TA. 
The columns £, of the matrix T are the eigenvectors of A € C”*” corresponding 
to the eigenvalue A: A€, = Ax€,. The matrix A is a diagonal matrix containing 
these eigenvalues, in the simple case with n distinct eigenvalues. 
Change of Coordinates 
x = Wa. 


The vector x € R” may be written as a € R” in the coordinate system given by 
the columns of © € R”*”. 


Measurement Equation 


y = Cx. 


The vector y € R” is a measurement of the state x € R” by the measurement 
matrix C € R?*". 
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Singular Value Decomposition 
X = UZV* = USV'. 


The matrix X € C”*™ may be decomposed into the product of three matrices 
Uecr™ vec, and V e C”™*™. The matrices U and V are unitary, so 
that UU* = U*U = I,x, and VV* = V*V = Inxm, where * denotes complex 
conjugate transpose. The columns of U (respectively V) are orthogonal, called 
left (respectively right) singular vectors. The matrix X contains decreasing, non- 
negative diagonal entries called singular values. 

Often, X is approximated with a low-rank matrix X= USV“, where U and 
V contain the first r <n columns of U and V, respectively, and © contains the 
first r x r block of ©. The matrix U is often denoted © in the context of spatial 
modes, reduced-order models, and sensor placement. 


Regression and Optimization 


Over-determined and Under-determined Optimization for Linear Systems 
argmin(|| Ax — b|| + Ag(x)) or 


argmin g(x) subjectto ||/Ax — bll2 < €. 


Here g(x) is a regression penalty (with penalty parameter \ for over-determined 
systems). For over- and under-determined linear systems of equations, which 
result in either no solutions or an infinite number of solutions of Ax = b, a 
choice of constraint or penalty, which is also known as regularization, must be 
made in order to produce a solution. 


Over-determined and Under-determined Optimization for Nonlinear Sys- 
tems 


argmin(f (A, x, b) + Ag(x)) or 


x 


argmin g(x) subjectto f(A,x,b) <e. 


x 


This generalizes the linear system to a nonlinear system f(-) with regulariza- 
tion g(-). These over- and under-determined systems are often solved using 
gradient descent algorithms. 


Compositional Optimization for Neural Networks 


argmin( fiy(Aw,--- fo(Ae, (fi(A1,x)) +++ )) + Ag(A;))- 


A; 
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Each A, denotes the weight connecting the neural network from the kth to the 
(k + 1)th layer. It is typically a massively under-determined system which is 
regularized by g(A,;). Composition and regularization are critical for generat- 
ing expressive representations and preventing overfitting. The full network is 
often denoted fg. 


Dynamical Systems and Reduced-Order Models 
Nonlinear Ordinary Differential Equation (Dynamical System) 


d 
Txe) = F(x(t), t). 


The vector x(t) € R” is the state of the system evolving in time t, 6 are param- 
eters, and f is the vector field. Generally, f is Lipschitz continuous to guarantee 
existence and uniqueness of solutions. 


Linear Input-Output System 


d 
res = Ax+ Bu, 


y = Cx + Du. 


The state of the system is x € R”, the inputs (actuators) are u € R4, and the 
outputs (sensors) are y € R’. The matrices A, B, C, and D define the dynamics, 
the effect of actuation, the sensing strategy, and the effect of actuation feed- 
through, respectively. 


Nonlinear Map (Discrete-Time Dynamical System) 


Xk+1 = F(x). 


The state of the system at the kth iteration is x, € R”, and F is a possibly 
nonlinear mapping. Often, this map defines an iteration forward in time, so 
that x, = x(kAt); in this case the flow map is denoted Fy. 


Koopman Operator Equation (Discrete-Time) 
Kig=g°F, => K= Mọ. 
The linear Koopman operator K, advances measurement functions of the state 


g(x) with the flow F,. The eigenvalues and eigenvectors of K, are \ and y(x), 
respectively. The operator K, operates on a Hilbert space of measurements. 
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Nonlinear Partial Differential Equation (PDE) 
u; = N(u, Uz, Urz,- - yt; B). 


The state of the PDE is u, the nonlinear evolution operator is N, subscripts de- 
note partial differentiation, and x and t are the spatial and temporal variables, 
respectively. The PDE is parameterized by values in 6. The state u of the PDE 
may be a continuous function u(x, t), or it may be discretized at several spatial 


locations, u(t) = [u(21,t) u(z2t) ... u(tn,t)]" €R”. 


Galerkin Expansion 
The continuous Galerkin expansion is 


r 


u(x,t) © X ay (t)v,(2). 


k=1 


The functions a(t) are temporal coefficients that capture the time dynamics, 
and (x) are spatial modes. For a high-dimensional discretized state, the Galerkin 
expansion becomes u(t) ~ X; a(t), The spatial modes Y, € R” may be 
the columns of Y = U. 


Complete Symbols 


Dimensions 


Number of non-zero entries in a K-sparse vector s 
Number of data snapshots (i.e., columns of X) 

Dimension of the state, x € R” 
Dimension of the measurement or output variable, y € R? 
Dimension of the input variable, u € R1 
Rank of truncated SVD, or other low-rank approximation 


sess 


Scalars 


s Frequency in Laplace domain 
t Time 
Learning rate in gradient descent 
At Time-step 
x Spatial variable 
Ax Spatial step 
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o Singular value 
à Eigenvalue 
à Sparsity parameter for sparse optimization (Section|7.3) 
A Lagrange multiplier (Sections and 
T Threshold 
Vectors 


Vector of mode amplitudes of x in basis Y, a € R” 
Action of reinforcement learning agent (Chapter|11) 

Vector of measurements in linear system Ax = b 

Vector of DMD mode amplitudes aae 

Vector containing potential function for PDE-FIND 

Residual error vector 

Sparse vector, s € R” (Chapter|3) 

State of the environment in reinforcement learning (Chapter{11} 
Control variable (Chapters 8 g and 10) 

PDE state vector (Chapters 12 and 13) 

Exogenous inputs 

w4 Disturbances to system 

w, Measurement noise 

w, Reference to track 

State of a system, x € R” 
Snapshot of data at time tų 

Data sample j € Z := {1,2,...,m} (Chapters|5|and|6) 
Reduced state, x € R”, so that x ~ Ux 

Estimated state of a system 
Vector of measurements, y € R? 
Data label j € Z := {1,2,...,m} (Chapters [5] and |6) 
Estimated output measurement 

Transformed state, x = Tz (Chapters 8|and 9) 

Error vector 

Bifurcation parameters 

Eigenvector of Koopman operator (Sections |7.4]and|7.5) 
Sparse vector of coefficients (Section]7.3) 

Neural network parameters 

DMD mode 

POD mode 

Vector of PDE measurements for PDE-FIND 


Se cnunr Ovorwes 
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Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


Acknowledgments xxiii 


Matrices 


A Matrix for system of equations or dynamics 
A Reduced dynamics on r-dimensional POD subspace 

Ax Matrix representation of linear dynamics on the state x 

Ay Matrix representation of linear dynamics on the observables y 
(A,B,C,D) Matrices for continuous-time state-space system 
Matrices for discrete-time state-space system 
Matrices for state-space system in new coordinates z = T~'x 
Matrices for reduced state-space system with rank r 
Actuation input matrix 
Linear measurement matrix from state to measurements 
Controllability matrix 
Discrete Fourier transform 
Matrix representation of linear dynamics on the states and 
inputs [x?u7]" 
Hankel matrix 
Time-shifted Hankel matrix 
Identity matrix 
Matrix form of Koopman operator (Chapter|7) 
Closed-loop control gain (Chapter|8) 
Kalman filter estimator gain 
LQR control gain 
Low-rank portion of matrix X (Chapter|3) 
Observability matrix 
Unitary matrix that acts on columns of X 
Weight matrix for state penalty in LOR (Section |8.4) 
Orthogonal matrix from QR factorization 
Weight matrix for actuation penalty in LOR (Section |8.4) 
Upper triangular matrix from QR factorization 
Sparse portion of matrix X (Chapter 3) 
Matrix of eigenvectors (Chapter 8) 
Change of coordinates (Chapters 8 and|9} 
Left singular vectors of X, U € R”*” 
Left singular vectors of economy SVD of X, U € R"*™” 
Left singular vectors (POD modes) of truncated SVD of X, U € R"*" 
Right singular vectors of X, V € R™*<™ 
Right singular vectors of truncated SVD of X, V € R™*" 
Matrix of singular values of X, X € R"*™ 
Matrix of singular values of economy SVD of X, X € R”*™ 
Matrix of singular values of truncated SVD of X, © € R’™" 
Eigenvectors of A 


> 

5 

Q 

t f> 

am ayaawZZe 


AA ee 


ake 


3 


SMMM SAI AP?GHHADADAOOVGH 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


XXiV 


Ten 


(A, 


Acknowledgments 
W. Controllability Gramian 
W, Observability Gramian 
X Data matrix, X € R"*” 
X’ Time-shifted data matrix, X’ € R"*” 
Y Projection of X matrix onto orthogonal basis in randomized SVD (Section 
Y Data matrix of observables, Y = g(X), Y € R?*™ ae 
Y’ Shifted data matrix of observables, Y’ = g(X’), Y’ € R?*™ hapter|7) 
Z Sketch matrix for randomized SVD, Z € R”*" (Section 1.8) 
© Measurement matrix times sparsifying basis, © = CW (C apter|3) 
© Matrix of candidate functions for SINDy (Section 7.3) 
T Matrix of derivatives of candidate functions for SINDy (Section [7.3] 
= Matrix of coefficients of candidate functions for SINDy Scam 
= Matrix of nonlinear snapshots for DEIM (Section|13.5) 
A Diagonal matrix of eigenvalues 
Y Input snapshot matrix, Y € R’*” 
® Matrix of DMD modes, ® £ X' VEW 
W Orthonormal basis (e.g., Fourier or POD modes) 
sors 
B,M) N-way array tensors of size I; x Ip x --- x In 


Norms 


0 
1 

-||2 €-norm of a vector x given by ||x||2 = V}; (27) 
2 


lo pseudo-norm of a vector x; the number of non-zero elements in x 


é;-norm of a vector x given by ||x||1 = X; lz: 


n 2 


2-norm of a matrix X given by ||X||2 = max, zo ||Xv||2/||v]|2 


- ||» Frobenius norm of a matrix X given by ||X|| 7 = Joe ya eal 


- ||. Nuclear norm of a matrix X given by ||X||,. = trace(VX*X) = >", 0; 


(form <n) 
-) Inner product; for functions, (f(x), 9(x)) = f°. f(x)g*(a) da. 


-,-) Inner product; for vectors, (u, v) = u*v. 


Operators, Functions, and Maps 


F 

F; 
fo 
f 


C 


Fourier transform 

Discrete-time dynamical system map 

Discrete-time flow map of dynamical system through time t 
Neural network (Chapter|6) 

Continuous-time dynamical system (Chapter|7) 
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Gabor transform 

Transfer function from inputs to outputs (Chapter|8) 
Scalar measurement function on x 

Vector-valued measurement functions on x 

Cost function for control 

Loss function for support vector machines (Chapter |5) 
Koopman operator (continuous-time) 

Koopman operator associated with time-t flow map 
Laplace transform 

Loop transfer function (Chapter|8) 

Linear partial differential equation (Chapters|12}and 
Nonlinear partial differential equation 


Order of magnitude 
Quality function (Chapter{11) 
Real part 


Sensitivity function (Chapter) 

Complementary sensitivity function (Chapter|8) 

Value function (Chapter{t] 

Wavelet transform 

Incoherence between measurement matrix C and basis Y 
Condition number 

Policy function for agent in reinforcement learning (Chapter|11) 
Koopman eigenfunction 

Gradient operator 

Convolution operator 


* AS ART STH HMSOOARHRH Aan QQ 


Most Common Acronyms 


CNN Convolutional neural network 
DL Deep learning 
DMD Dynamic mode decomposition 
FFT Fast Fourier transform 
ODE Ordinary differential equation 
PCA Principal component analysis 
PDE Partial differential equation 
POD Proper orthogonal decomposition 
RL Reinforcement learning 
ROM Reduced-order model 
SVD Singular value decomposition 
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Other Acronyms 
ADM Alternating directions method 
AIC Akaike information criterion 
ALM Augmented Lagrange multiplier 
ANN Artificial neural network 
ARMA Autoregressive moving average 
ARMAX Autoregressive moving average with exogenous input 
BIC Bayesian information criterion 
BPOD Balanced proper orthogonal decomposition 
DMDc Dynamic mode decomposition with control 
CCA Canonical correlation analysis 
CFD Computational fluid dynamics 
CoSaMP Compressive sampling matching pursuit 
CWT Continuous wavelet transform 
DEIM Discrete empirical interpolation method 
DCT Discrete cosine transform 
DFT Discrete Fourier transform 
DMDc_ Dynamic mode decomposition with control 
DNS Direct numerical simulation 
DP Dynamic programming 
DQN Deep Q network 
DRL Deep reinforcement learning 
DWT Discrete wavelet transform 
ECOG § Electrocorticography 
eDMD Extended DMD 
EIM Empirical interpolation method 
EM Expectation maximization 
EOF Empirical orthogonal functions 
ERA Eigensystem realization algorithm 
ESC Extremum-seeking control 
GMM Gaussian mixture model 
HAVOK Hankel alternative view of Koopman 
HER Hindsight experience replay 
HJB Hamilton-Jacobi-Bellman equation 
JL Johnson-Lindenstrauss 
KL Kullback—Leibler 
ICA Independent component analysis 
KLT Karhunen-Loéve transform 
LAD Least absolute deviations 
LASSO Least absolute shrinkage and selection operator 
LDA Linear discriminant analysis 
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LOE 

LQG 

LOR 

LTI 

MDP 
MIMO 
MLC 

MPE 
mrDMD 
NARMAX 
NLS 
OKID 
PBH 

PCP 
PDE-FIND 


PDF 
PID 
PINN 
PIV 
RIP 
rSVD 
RKHS 
RNN 
RPCA 
SGD 
SINDy 
SINDYc 
SISO 
SRC 
SSA 
STFT 
STLS 
SVM 
TICA 
VAC 


Linear-quadratic estimator 

Linear—quadratic Gaussian controller 
Linear—quadratic regulator 

Linear time-invariant system 

Markov decision process 

Multiple-input, multiple-output 

Machine learning control 

Missing point estimation 

Multi-resolution dynamic mode decomposition 
Nonlinear autoregressive model with exogenous inputs 
Nonlinear Schrödinger equation 

Observer Kalman filter identification 
Popov-Belevitch—Hautus test 

Principal component pursuit 

Partial differential equation functional identification 
of nonlinear dynamics 

Probability density function 
Proportional—integral—derivative control 
Physics-informed neural network 

Particle image velocimetry 

Restricted isometry property 

Randomized SVD 

Reproducing kernel Hilbert space 

Recurrent neural network 

Robust principal component analysis 
Stochastic gradient descent 

Sparse identification of nonlinear dynamics 
SINDy with control 

Single-input, single-output 

Sparse representation for classification 
Singular spectrum analysis 

Short-time Fourier transform 

Sequential thresholded least-squares 

Support vector machine 

Time-lagged independent component analysis 
Variational approach of conformation dynamics 
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Part I 


Dimensionality Reduction and 
Transforms 


Chapter 1 


Singular Value Decomposition (SVD) 


The singular value decomposition (SVD) is among the most important matrix 
factorizations of the computational era, providing a foundation for nearly all of 
the data methods in this book. The SVD provides a numerically stable matrix 
decomposition that can be used for a variety of purposes and is guaranteed to 
exist. We will use the SVD to obtain optimal low-rank approximations to ma- 
trices and to perform pseudo-inverses of non-square matrices to find a solution 
to the system of equations Ax = b. The SVD will also be used as the underly- 
ing algorithm of principal component analysis (PCA), where high-dimensional 
data is decomposed into its most statistically descriptive factors. SVD/PCA has 
been applied to a wide variety of problems in science and engineering. 

In a sense, the SVD generalizes the concept of the fast Fourier transform 
(FFT), which will be the subject of the next chapter. Many engineering texts 
begin with the FFT, as it is the basis of many classical analytical and numerical 
results. However, the FFT works in idealized settings, and the SVD is a more 
generic data-driven technique. Because this book is focused on data, we begin 
with the SVD, which may be thought of as providing a basis that is tailored to 
the specific data, as opposed to the FFT, which provides a generic basis. 

In many domains, complex systems will generate data that is naturally ar- 
ranged in large matrices, or more generally in arrays. For example, a time series 
of data from an experiment or a simulation may be arranged in a matrix, with 
each column containing all of the measurements at a given time. If the data at 
each instant in time is multi-dimensional, as in a high-resolution simulation of 
the weather in three spatial dimensions, it is possible to reshape or flatten this 
data into a high-dimensional column vector, forming the columns of a large 
matrix. Similarly, the pixel values in a grayscale image may be stored in a ma- 
trix, or these images may be reshaped into large column vectors in a matrix 
to represent the frames of a movie. Remarkably, the data generated by these 
systems is typically low-rank, meaning that there are a few dominant patterns 
that explain the high-dimensional data. The SVD is a numerically robust and 
efficient method of extracting these patterns from data. 


4 CHAPTER 1. SINGULAR VALUE DECOMPOSITION (SVD) 


1.1 Overview 


Here we introduce the singular value decomposition (SVD) and develop an 
intuition for how to apply the SVD by demonstrating its use on a number of 
motivating examples. The SVD will provide a foundation for many other tech- 
niques developed in this book, including classification methods in Chapter [5 
the dynamic mode decomposition (DMD) in Chapter|7| and the proper orthog- 
onal decomposition (POD) in Chapter[12} Detailed mathematical properties are 
discussed in the following sections. 

High dimensionality is a common challenge in processing data from com- 
plex systems. These systems may involve large measured data sets including 
audio, image, or video data. The data may also be generated from a physical 
system, such as neural recordings from a brain, or fluid velocity measurements 
from a simulation or experiment. In many naturally occurring systems, it is ob- 
served that data exhibit dominant patterns, which may be characterized by a 
low-dimensional attractor or manifold [334] 335]. 

As an example, consider images, which typically contain a large number of 
measurements (pixels), and are therefore elements of a high-dimensional vector 
space. However, most images are highly compressible, meaning that the rele- 
vant information may be represented in a much lower-dimensional subspace. 
The compressibility of images will be discussed in depth throughout this book. 
Complex fluid systems, such as the Earth’s atmosphere or the turbulent wake 
behind a vehicle, also provide compelling examples of the low-dimensional 
structure underlying a high-dimensional state space. Although high-fidelity 
fluid simulations typically require at least millions or billions of degrees of free- 
dom, there are often dominant coherent structures in the flow, such as periodic 
vortex shedding behind vehicles or hurricanes in the weather. 

The SVD provides a systematic way to determine a low-dimensional ap- 
proximation to high-dimensional data in terms of dominant patterns. This tech- 
nique is data-driven in that patterns are discovered purely from data, without 
the addition of expert knowledge or intuition. The SVD is numerically stable 
and provides a hierarchical representation of the data in terms of a new coor- 
dinate system defined by dominant correlations within the data. Moreover, the 
SVD is guaranteed to exist for any matrix, unlike the eigendecomposition. 

The SVD has many powerful applications beyond dimensionality reduction 
of high-dimensional data. It is used to compute the pseudo-inverse of non- 
square matrices, providing solutions to under-determined or over-determined 
matrix equations, Ax = b. We will also use the SVD to de-noise data sets. The 
SVD is likewise important to characterize the input and output geometry of a 
linear map between vector spaces. These applications will all be explored in 
this chapter, providing an intuition for matrices and high-dimensional data. 
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Definition of the SVD 


Generally, we are interested in analyzing a large data set X € C”*™:; 
X= |x X © Xml. (1.1) 


The columns x; € C” may be measurements from simulations or experiments. 
For example, columns may represent images that have been reshaped into col- 
umn vectors with as many elements as pixels in the image. The column vectors 
may also represent the state of a physical system that is evolving in time, such 
as the fluid velocity at a set of discrete points, a set of neural measurements, or 
the state of a weather simulation with one square kilometer resolution. 

The index k is a label indicating the kth distinct set of measurements. For 
many of the examples in this book, X will consist of a time series of data, and 
x, = x(kAt). Often the state dimension n is very large, on the order of millions 
or billions of degrees of freedom. The columns are often called snapshots, and 
m is the number of snapshots in X. For many systems n > m, resulting in a 
tall-skinny matrix, as opposed to a short-fat matrix when n < m. 

The SVD is a unique matrix decomposition that exists for every complex- 
valued matrix X € C”*™; 


X = UEV"*, (1.2) 


where U € C”*" and V e C™*™ are unitary matriceg!|with orthonormal columns, 
and X € R”*™ is a matrix with real, non-negative entries on the diagonal and 
zeros off the diagonal. Here * denotes the complex conjugate transpose[’|As we 
will discover throughout this chapter, the condition that U and V are unitary 
is used extensively. 

When n > m, the matrix X has at most m non-zero elements on the diago- 
nal, and may be written as 


Therefore, it is possible to exactly represent X using the economy SVD: 


A 


>>) 


X = UNV* = [ô o'| a vt = USV*. (1.3) 
The full SVD and economy SVD are shown in Fig. The columns of U+ span 


a vector space that is complementary and orthogonal to that spanned by U. The 


1A square matrix U is unitary if UU* = U*U = I. 
2For real-valued matrices, this is the same as the regular transpose X* = x 
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Figure 1.1: Schematic of matrices in the full and economy SVD. 


columns of U are called left singular vectors of X and the columns of V are right 
singular vectors. The diagonal elements of $ € C™*™ are called singular values 
and they are ordered from largest to smallest. The rank of X is equal to the 
number of non-zero singular values. We will show in Section [1.2}that the SVD 
can also be used to obtain an optimal rank-r approximation of X for r < m. 


Computing the SVD 


The SVD is a cornerstone of computational science and engineering, and the 
numerical implementation of the SVD is both important and mathematically 
enlightening. That said, most standard numerical implementations are mature 
and a simple interface exists in many modern computer languages, allowing 
us to abstract away the details underlying the SVD computation. For most 
purposes, we simply use the SVD as a part of a larger effort, and we take for 
granted the existence of efficient and stable numerical algorithms. Numerically, 
the SVD may be computed by first reducing the matrix X to a bidiagonal ma- 
trix and then using an iterative algorithm to compute the SVD of the bidiag- 
onal matrix. For matrices with high aspect ratio (i.e., n >> m), then the first 
step may be achieved by first computing a OR factorization to reduce X to an 
upper triangular matrix, followed by Householder reflections to reduce this 
upper triangular matrix into a bidiagonal form. The second step may be per- 
formed using a modified QR algorithm developed by Golub & Kahan [285]. 
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Details of the QR factorization are beyond the scope of this book, although it 
is a straightforward greedy method that is closely related to Gram-Schmidt 
orthogonalization. There are numerous variations, alternatives, and important 
results on the computation of the SVD; for a more thorough discussion, see 
Z111]. Randomized numerical algorithms are increas- 
ingly used to compute the SVD of very large matrices as discussed in Sec- 
tion 


MATLAB 
In MATLAB, computing the SVD is straightforward: 


Create a 5x3 random data matrix 
Singular value decomposition 


>> X = randn (5,3); 
>> WU SVI = svdil(x)r, 


For non-square matrices X, the economy SVD is more efficient: 


|| >> [Uhat, Shat,V] = svd(X,’econ’); * Economy sized SVD 


Python] 


>>> import numpy as np 

SSS X = Np. random. rand(5, >) # Create random data matrix 

[> U o Vl = np elinale.svd(x,tulliimatrices=lrue) reull SVD 

>>> Uhat, Shat, VThat = np.linalg.svd(X, full_matrices=False) 
# Economy SVD 


R 
> X <- replicate(3, rnorm(5) ) 
> s <- svd(X) 
> U <- s$u 
> S <- diag(s$d) 
> V <- s$v 
Mathematica 
In:= X=RandomReal[{0,1},{5,3}] 
In:= {U,S,V} = SingularValueDecomposition [X] 


3Note that Python outputs the transpose of V 
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Other Languages 


The SVD is also available in other languages, such as Fortran and C++. In fact, 
most SVD implementations are based on the LAPACK (Linear Algebra Pack- 
age) in Fortran. The SVD routine is designated DGESVD in LAPACK, and 
this is wrapped in the C++ libraries Armadillo and Eigen. 


Historical Perspective 


The SVD has a long and rich history, ranging from early work developing the 
theoretical foundations to modern work on computational stability and effi- 
ciency. There is an excellent historical review by Stewart [676], which provides 
context and many important details. The review focuses on the early theoret- 
ical work of Beltrami and Jordan (1873), Sylvester (1889), Schmidt (1907), and 
Weyl (1912). It also discusses more recent work, including the seminal compu- 
tational work of Golub and collaborators [285,286]. In addition, there are many 
excellent chapters on the SVD in modern texts [24 [420] [711]. 


Uses in This Book and Assumptions of the Reader 


The SVD is the basis for many related techniques in dimensionality reduc- 
tion. These methods include principal component analysis (PCA) in statistics 
[552], the Karhunen—Loéve transform (KLT) [453], empirical or- 
thogonal functions (EOFs) in climate [459], the proper orthogonal decomposi- 
tion (POD) in fluid dynamics [335], and canonical correlation analysis (CCA) 
[176]. Although developed independently in a range of diverse fields, many of 
these methods only differ in how the data is collected and pre-processed. There 
is an excellent discussion about the relationship between the SVD, the KLT, and 
PCA by Gerbrands [275]. 

The SVD is also widely used in system identification and control theory 
to obtain reduced-order models that are balanced in the sense that states are 
hierarchically ordered in terms of their ability to be observed by measurements 
and controlled by actuation [509]. 

For this chapter, we assume that the reader is familiar with linear algebra, 
with some experience in computation and numerics. For review, there are a 
number of excellent books on numerical linear algebra, with discussions on the 


SVD 711). 


1.2 Matrix Approximation 


Perhaps the most useful and defining property of the SVD is that it provides 
an optimal low-rank approximation to a matrix X. In fact, the SVD provides a 
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hierarchy of low-rank approximations, since a rank-r approximation is obtained 
by keeping the leading r singular values and vectors, and discarding the rest. 
Because & is diagonal, it is possible to express the matrix X = UX V* as a 
sum of rank-one matrices: 
X = ` OkURVY. = C101 V] + O2U2V} + te + OmUmvV,,; (1.4) 
k=1 
where o;, is the kth diagonal entry of X, and u, and v, are the kth columns of 
U and V, respectively. This is known as the dyadic summation. The singular 
values o; are arranged in decreasing order, o1 > o2 > --: > Om > 0, so each 
subsequent rank-one matrix o;,,u,;v; is less important than the previous matrix 
in capturing the information in X. For many systems, the singular values cx 
decrease rapidly, and it is possible to obtain a good approximation of X by 
truncating at some rank r: 


XeX= ` OnURV, = C1U1V] + OqUeV5 +--+ + Orup VŽ. (1.5) 
k=1 

Here, we establish the notation that a truncated SVD basis (and the resulting 
approximated matrix X) will be denoted by X = UDV*, where U and V con- 
tain the first r columns of U and V, and & contains the first r x r sub-block 
of X. The truncated SVD is illustrated in Fig. |1.2| with U, ©, and V denoting 
the truncated matrices. If X does not have full rank, then some of the singular 
values in $ may be zero, and the truncated SVD may still be exact. However, 
for truncation values r that are smaller than the number of non-zero singular 
values (i.e., the rank of X), the truncated SVD only approximates X. 

For a given rank r, there is no better approximation for X, in the £2 sense, 
than the truncated SVD approximation X. The Eckart-Young theorem below 
will state this precisely and provide expressions for the error of the truncated 
approximation. There are numerous choices for the truncation rank r, and they 
are discussed in Section{1.7| Thus, high-dimensional data may be well described 
by a few dominant patterns given by the columns of U and V. 

This is an important property of the SVD, and we will return to it many 
times. There are numerous examples of data sets that contain high-dimensional 
measurements, resulting in a large data matrix X. However, there are often 
dominant low-dimensional patterns in the data, and the truncated SVD basis U 
provides a coordinate transformation from the high-dimensional measurement 
space into a low-dimensional pattern space. This has the benefit of reducing the 
size and dimension of large data sets, yielding a tractable basis for visualization 
and analysis. Finally, many systems considered in this text are dynamic (see 
Chapter u} and the SVD basis provides a hierarchy of modes that characterize 
the observed attractor, on which we may project a low-dimensional dynamical 
system to obtain reduced-order models (see Chapter [13). 
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Figure 1.2: Schematic of truncated SVD. The subscript ‘tem’ denotes the remain- 
der of U, ©, or V after truncation. 


Optimal Approximation and Error Bounds 


Schmidt (of Gram-Schmidt) generalized the SVD to function spaces and de- 
veloped an approximation theorem, establishing the truncated SVD X as the 
optimal low-rank approximation of the underlying matrix X [639]. Schmidt's 
approximation theorem was rediscovered by Eckart and Young [228], and is 
sometimes referred to as the Eckart-Young theorem. 


Theorem 1.1 (Eckart-Young [228]) The optimal rank-r approximation to X, in a 
least-squares sense, is given by the rank-r SVD truncation X: 


argmin |X — X||p = USV*. (1.6) 
X, s.t. rank(X)=r 


Again, U and V denote the first r leading columns of U and V, and © con- 
tains the leading r x r sub-block of &. The Frobenius norm above is defined as 
IX] r = y Xia oj 1 Xy’, which is equivalent to the 2-norm of the vectorized 
matrix X(:). 

Thus, the Eckart-Young theorem guarantees that the truncated SVD pro- 
vides the best matrix approximation of a given rank in the Frobenius norm. It 
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is also possible to exactly quantify the error of the rank-r SVD approximation: 


|IX-XIR= Yo of. (1.7) 


k=r+1 


Thus, all other rank-r matrices X will have at least this much error. Because 
the error scales with the size and magnitude of X, it is often more useful to 
consider the relative error 


IX — XI|} 
Xi ai 
This expression for the relative error in the Frobenius norm has two intuitive 
interpretations. If the columns of X are velocity fields, for example from a dis- 
cretized fluid flow simulation, then this error is related to the fraction of the ki- 
netic energy that is missing in the approximation X. More generally, the squared 
Frobenius norm error of mean-subtracted data has the interpretation of the 
amount of missing variance in the approximation X. This statistical interpre- 
tation will be explored more in Section [1.5] 

Remarkably, the SVD also provides an optimal rank-r approximation in the 
matrix 2-norm, also known as the spectral norm: 


argmin = ||X — X|] = USV*. (1.9) 
X, s.t. rank(X)=r 


The 2-norm of a matrix X is induced by the vector 2-norm and is given by 


X 
[Xllo= max lve 
v [vil 


The error expression for the rank-r SVD approximation is even simpler in the 
2-norm: 


|X — Xo = ops. (1.10) 


This error expression is rather simple to derive by expanding 
X-X= So ouvy. (1.11) 


Since each of the v;, vectors is orthonormal, the maximum ||(X — X)v||2 = 0,41 
is achieved for v = v,41. 
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Example: Image Compression 


We demonstrate the idea of matrix approximation with a simple example: im- 
age compression. A recurring theme throughout this book is that large data sets 
often contain underlying patterns that facilitate low-rank representations. Nat- 
ural images present a simple and intuitive example of this inherent compress- 
ibility. A grayscale image may be thought of as a real-valued matrix X € R"*™, 
where n and m are the number of pixels in the vertical and horizontal di- 
rections, respectivelyf] Depending on the basis of representation (pixel space, 
Fourier frequency domain, SVD transform coordinates), images may have very 
compact approximations. 

Consider the image of Mordecai the snow dog in Fig. This image has 
2000 x 1500 pixels. It is possible to take the SVD of this image and plot the 
diagonal singular values, as in Fig. Figure|1.3]shows the approximate ma- 
trix X for various truncation values r. By r = 100, the reconstructed image is 
quite accurate, and the singular values account for almost 80% of the total cu- 
mulative sum of the singular values. The squared error is less than 4% in the 
Frobenius norm. The SVD truncation results in a compression of the original 
image, since only the first 100 columns of U and V, along with the first 100 
diagonal elements of 1, must be stored in U, ys, and V. 


Code 1.1: [MATLAB] Use SVD to compress image. 


% First, we load the image 

A=imread(’../DATA/dog. jpg’); 

X=double(rgb2gray(A)); % Convert RBG->gray, 256 bit-—>double. 
nx = size(X,1); ny = size(X,2); 

imagesc(X), axis off, colormap gray 


@ Take the SVD 
[U,S,V] = svd(X); 
% Approximate matrix with truncated SVD for various ranks r 
for r=[5 20 100] % Truncation value 
Napprex = Wiel: mins (ls lise) eV Gy, li) > Approx. image 
figure, imagesc(Xapprox), axis off 
title ([' r=" , num2str (r,'3%d')]); 
end 
% Plot singular values and cumulative sum 
subplot (1,2,1), semilogy (diag(S),’k’) 
subplot (1,2,2), plot (cumsum (diag (S) ) /sum(diag(S)),’k’) 


4It is not uncommon for image size to be specified as horizontal by vertical, i.e., XT e R”*”, 
although we stick with vertical by horizontal to be consistent with generic matrix notation. 
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r= 5, 0.57% storage 


Figure 1.3: Image compression of Mordecai the snow dog, truncating the SVD 
at various ranks r. Original image resolution is 2000 x 1500. 


Code 1.1: [Python] Use SVD to compress image. 


# First, we load the image 

from matplotlib.image import imread 

A= Amread (os. path. joun(* 22%, DATA’ "dog. ipa” )) 

X = np.mean(A, -1); # Convert RGB to grayscale 

img = plt.imshow (X) 

# Take the SVD 

U, 5, VI = np. linalg.svd(%, full matrices False) 

S = np.diag (Ss) 

# Approximate matrix with truncated SVD for various ranks r 
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Figure 1.4: (a) Singular values o, and (b) cumulative sum )>)_, ox of the first r 
singular values. 


for r in (o7 207 r00): # Construct approximate image 
Xaperox = Uoer e SOn e e e VT] 
img = plt.imshow (Xapprox) 
plt.show() 

# Plot singular values and cumulative sum 

Date semilogy p. drago) 

plt.plot (np.cumsum(np.diag(S))/np.sum(np.diag(S))) 


1.3 Mathematical Properties and Manipulations 


Here we describe important mathematical properties of the SVD, including ge- 
ometric interpretations of the unitary matrices U and V, as well as a discussion 
of the SVD in terms of dominant correlations in the data X. The relationship be- 
tween the SVD and correlations in the data will be explored more in Section{I.5] 
on principal component analysis. 


Interpretation as Dominant Correlations 


The SVD is closely related to an eigenvalue problem involving the correlation 
matrices XX* and X*X, shown in Fig. m a specific image, and in Figs. 
and|1.7|for generic matrices. If we plug (1.3) into the row-wise correlation ma- 
trix X X* and the column-wise correlation matrix X*X, we find 


` a2 
XX* =U ol vv [= 0] Ut =U E J U*, (1.12a) 
Xx*X=V [> 0] U*U ol vV* = VÄ?V*. (1.12b) 
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XX* 


Figure 1.5: Correlation matrices XX* and X*X for a matrix X obtained from an 
image of a dog. Note that both correlation matrices are symmetric. 


Recalling that U and V are unitary, U, ©, and V are solutions to the following 
eigenvalue problems: 


NOESEN DS i. 
XX*U =U | 5 0 , (1.13a) 
X*XV = V>?. (1.13b) 


In other words, each non-zero singular value of X is a positive square root of 
an eigenvalue of X*X and of XX*, which have the same non-zero eigenvalues. 
It follows that, if X is self-adjoint (i.e., X = X*), then the singular values of X 
are equal to the absolute value of the eigenvalues of X. 

This provides an intuitive interpretation of the SVD, where the columns of 
U are eigenvectors of the correlation matrix XX*, and the columns of V are 
eigenvectors of X*X. We choose to arrange the singular values in descending 
order by magnitude, and thus the columns of U are hierarchically ordered by 
how much correlation they capture in the columns of X; similarly V captures 
correlation in the rows of X. 


Method of Snapshots 


It is often impractical to construct the matrix XX* because of the large size of 
the state dimension n, let alone solve the eigenvalue problem: if x has a million 
elements, then XX* has a trillion elements. In 1987, Sirovich observed that it 
is possible to bypass this large matrix and compute the first m columns of U 
using what is now known as the method of snapshots [663]. 

Instead of computing the eigendecomposition of XX* to obtain the left sin- 
gular vectors U, we only compute the eigendecomposition of X*X, which is 
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SSS © 


Figure 1.6: Correlation matrix XX* is formed by taking the inner product of 
rows of X. 


C) O 
X* X = X*X 


Figure 1.7: Correlation matrix X*X is formed by taking the inner product of 
columns of X. 


much smaller and more manageable. From (1.13b), we then obtain V and £. If 
there are zero singular values in $, then we ae keep the r non-zero part, $, 
and the corresponding columns V of V. From these matrices, it is then possible 
to approximate U, the first r columns of U, as follows: 


1 


U=Xv>. (1.14) 


Generalization of the Eigendecomposition 


In a sense, the singular value decomposition is a generalization of the eigen- 

decomposition that is valid for all matrices, including non-square matrices and 

defective square matrices that do not have a complete basis of eigenvectors. 
The eigendecomposition of a diagonalizable square matrix X is given by 


XV=VA = X=VAV", (1.15) 


where the columns of V are eigenvectors and the corresponding entries of the 
diagonal matrix A are the eigenvalues. For Hermitian matrices (i.e., self-adjoint 
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matrices such that X = X*), the eigendecomposition takes the form 
XV=VA = X=VAV* (1.16) 


and the eigenvalues are real. 
The singular value decomposition may be written in a similar form for a 
generic matrix X as 


XV=US = X=UDV". (1.17) 


In this way, the singular value spectrum, given by the collection of singular val- 
ues in X, generalizes the notion of an eigenvalue spectrum, given by the col- 
lection of eigenvalues in A. Similarly, the left and right singular vectors have 
an interpretation as a change of coordinates in the input space C” and output 
space C”, much as the eigenvectors provide a change of coordinates to diago- 
nalize a square matrix. 


Geometric Interpretation 


The columns of the matrix U provide an orthonormal basis for the column 
space of X. Similarly, the columns of V provide an orthonormal basis for the 
row space of X. If the columns of X are spatial measurements in time, then U 
encodes spatial patterns, and V encodes temporal patterns. 

One property that makes the SVD particularly useful is the fact that both U 
and V are unitary matrices, so that UU* = U*U = I„xn and VV* = V*V = 
Imxm. This means that solving a system of equations involving U or V is as 
simple as multiplication by the transpose, which scales as O(n”), as opposed to 
traditional methods for the generic inverse, which scale as O(n). As noted in 
the previous section and in [79], the SVD is intimately connected to the spectral 
properties of the compact self-adjoint operators XX* and X*X. 

The SVD of X may be interpreted geometrically based on how a hyper- 
sphere, given by S"-! = {x] ||x|]2 = 1} c R”, maps into an ellipsoid, {y | y = 
Xx forx € S”~'} C R”, through X. This is shown graphically in Fig. [1.8] for 
a sphere in R? and a mapping X with three non-zero singular values. Because 
the mapping through X (i.e., matrix multiplication) is linear, knowing how it 
maps the unit sphere determines how all other vectors will map. 

For the specific case shown in Fig. we construct the matrix X out of 
three rotation matrices, R,, R}, and R., and a fourth matrix to stretch out and 
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2 2 


1 


X 
= 


Figure 1.8: Geometric illustration of the SVD as a mapping from a sphere in R” 
to an ellipsoid in R”. 


scale the principal axes: 


cos(@3) —sin(@3) 0 cos(@2) 0 sin(@2) 
X = |sin(63) cos(ðş) 0 0 1 0 
0 0 1] |—sin(62) 0 cos(@2) 


x |0 cos(O;) —sin(™,)| |0 o 0 
0 sin(@;) cos(0;) 0 0 o 
a 


Re 


In this case, 6; = 7/15, 02 = —7/9, and 63 = —7/20, and cı = 3, 02 = 1, and 
a3 = 0.5. These rotation matrices do not commute, and so the order of rota- 
tion matters. If one of the singular values is zero, then a dimension is removed 
and the ellipsoid collapses onto a lower-dimensional subspace. The product 
R,R,R, is the unitary matrix U in the SVD of X. The matrix V is the identity. 
Codes to reproduce this example in MATLAB and in Python are provided on 
the book’s GitHub. 


Invariance of the SVD to Unitary Transformations 


A useful property of the SVD is that if we left- or right-multiply our data matrix 
X by a unitary transformation, it preserves the terms in the SVD, except for the 
corresponding left or right unitary matrix U or V, respectively. This has impor- 
tant implications, since the discrete Fourier transform (DFT; see Chapter |2) F 
is a unitary transform, meaning that the SVD of data X = FX will be exactly 
the same as the SVD of X, except that the modes U will be the DFT of modes 
U: U = FU. In addition, the invariance of the SVD to unitary transformations 
enables the use of compressed measurements to reconstruct SVD modes that 
are sparse in some transform basis (see Chapter B). 
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The invariance of SVD to unitary transformations is geometrically intuitive, 
as unitary transformations rotate vectors in space, but do not change their inner 
products or correlation structures. We denote a left unitary transformation by 
C,so that Y = CX, and a right unitary transformation by P*,so that Y = XP*. 
The SVD of X will be denoted Ux Ux VX and the SVD of Y will be Uy Hy V¥>. 


Left Unitary Transformations 


First, consider a left unitary transformation of X: Y = CX. Computing the 
correlation matrix Y*Y, we find 


Y*Y = X*C*CX = X*X. (1.18) 


The projected data has the same eigendecomposition, resulting in the same Vx 
and ix. Using the method of snapshots to reconstruct Uy, we find 


Uy = YVx=x' = CXVxb x! = CUx. (1.19) 
Thus, Uy = CUx, Sy = =x, and Vy = Vx. The SVD of Y is then 
Y = CX = CUx=xV x. (1.20) 
Right Unitary Transformations 
For a right unitary transformation Y = XP*, the correlation matrix Y*Y is 
Y*Y = PX*XP* = PVx=V%P*, (1.21) 
with the following eigendecomposition: 
Y*YPVx = PVx dX. (1.22) 


Thus, Vy = PVx and Sy = Ux. We may use the method of snapshots to 
reconstruct Uy: 


Uy = YPVx X; = XVx bx! = Ux. (1.23) 
Thus, Uy = Ux, and we may write the SVD of Y as 
Y = XP* = Ux Ex VP. (1.24) 


1.4 Pseudo-Inverse, Least-Squares, and Regression 


Many physical systems may be represented as a linear system of equations, 


Ax =b, (1.25) 
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where the constraint matrix A and vector b are known, and the vector x is 
unknown. If A is a square, invertible matrix (i.e., A has non-zero determinant), 
then there exists a unique solution x for every b. However, when A is either 
singular or rectangular, there may be one, none, or infinitely many solutions, 
depending on the specific b and the column and row spaces of A. 

First, consider the wnder-determined system, where A € C™™ andn < m 
(i.e., A is a short-fat matrix), so that there are fewer equations than unknowns. 
This type of system is likely to have columns that span all of R”, since it has 
many more columns than are required for a linearly independent basisP|Gener- 
ically, if a short-fat A has n linearly independent columns (i.e., its column space 
spans R”), then there are infinitely many solutions x for every b. The system 
is called under-determined because there are not enough values in b to uniquely 
determine the higher-dimensional x. 

Similarly, consider the over-determined system, where n >> m (i.e., a tall- 
skinny matrix), so that there are more equations than unknowns. This matrix 
cannot have n linearly independent columns, and so it is guaranteed that there 
are vectors b that have no solution x. In fact, there will only be a solution x if b 
is in the column space of A, i.e., b € col(A). 

Technically, there may be some choices of b that admit infinitely many so- 
lutions x for a tall-skinny matrix A and other choices of b that admit zero so- 
lutions even for a short-fat matrix. The solution space to the system in (1.25) 


is determined by the following four fundamental subspaces of A = UEV, 
where the rank r is chosen to include all non-zero singular values: 


e The column space, col(A), is the span of the columns of A, also known as 
the range. The column space of A is the same as the column space of U. 


e The orthogonal complement to col(A) is ker(A*), given by the column 
space of U+ from Fig.{1.1| 


e The row space, row(A), is the span of the rows of A, which is spanned by 
the columns of V. The row space of A is equal to row(A) = col(A*). 


e The kernel space, ker(A), is the orthogonal complement to row(A), and 
is also known as the null space. The null space is the subspace of vectors 
that map through A to zero, i.e., Ax = 0, given by col(V +). 


More precisely, if b € col(A) and if dim(ker(A)) 4 0, then there are infinitely 
many solutions x. Note that the condition dim(ker(A)) # 0 is guaranteed for 


“It is easy to construct degenerate examples where the columns of a short-fat matrix do not 
form a full basis for R”, such as 
ka fi 1 1 l 


1111 
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a short-fat matrix. Similarly, if b ¢ col(A), then there are no solutions, and the 
system of equations in (1.25) is called inconsistent. 
The fundamental subspaces above satisfy the following properties: 


col(A) 9 ker(A*) = R”, (1.26a) 
col(A*) © ker(A) = R”. (1.26b) 


Remark 1.1 There is an extensive literature on random matrix theory, where the above 
stereotypes are almost certainly true, meaning that they are true with high probability. 
For example, a system Ax = b is extremely unlikely to have a solution for a random 
matrix A € R"*™ and random vector b € R” with n >> m, since there is little chance 
that b is in the column space of A. These properties of random matrices will play a 
prominent role in compressed sensing (see Chapter|3). 


In the over-determined case when no solution exists, we would often like 
to find the solution x that minimizes the sum-squared error || Ax — b||, the so- 
called least-squares solution. Note that the least-squares solution also minimizes 
|| Ax — b||2. In the under-determined case when infinitely many solutions exist, 
we may like to find the solution x with minimum norm ||x||2 so that Ax = b, 
the so-called minimum-norm solution. 

The SVD is the technique of choice for these important optimization prob- 
lems. First, if we substitute an exact truncated SVD A = USV_ in for A, we 
can “invert” each of the EAE U, ©, and V” in turn, resulting in the Moore- 


Penrose left pseudo-inverse (560) 5611 (604) (776) 776] Ai of A: 
Ava U = ata=vv’ (1.27) 


Note that AtA will only equal the identity Imxm if the truncated SVD captures 
all non-zero singular values; otherwise VV. Z Inxm, and it will only approxi- 
mate the identity. This may be used to find both the minimum-norm and least- 
squares solutions to (1.25): 


ATAX=Atbh = x=VEU'D. (1.28) 
Plugging the solution x back in to (1.25) results in 


Ax =USV'VSE Üb (1.29a) 


= UU'b. (1.29b) 


Although U*U = UU* = I,,.,, for the exact SVD, UU is not necessarily the 
identity matrix for a truncated basis of left singular vectors U, but is rather a 
projection onto the column space of U. Therefore, x will only be an exact solu- 
tion to (1.25) when b is in the column space of U, and therefore in the column 
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space of A. Assuming that UU is equal to the identity is one of the most com- 
mon accidental misuses of the SVD!°| However, it is still true that vu Lg 
where r is the rank of A. 

Computing the pseudo-inverse At is computationally efficient, after the ex- 
pensive up-front cost of computing the SVD. Inverting the unitary matrices U 
and V involves matrix multiplications by the transpose matrices, which are 
O(n?) operations. Inverting © is even more efficient, since it is a diagonal ma- 
trix, requiring O(n) operations. In contrast, inverting a dense square matrix 
would require an O(n?) operation. 


Condition Number 


The condition number of a matrix A is a measure of how sensitive matrix mul- 
tiplication and inversion are to errors in the input. Larger condition number in- 
dicates higher sensitivity and worse performance. The condition number «(A) 
is directly related to the singular values of the matrix: 


Omax(A) 
Omin(A) ` 


The condition number is a central concept in all of numerical linear alge- 
bra and applied computation. It is easiest to understand the effect of a large 
condition number when considering the linear system of equations Ax = b. If 
the vector x is not specified perfectly, but instead has some error ex, then the 
system becomes 


K(A) = (1.30) 


A(x + €x) = b + 6p, (1.31) 


where ep is the corresponding error in b. If we assume a worst-case scenario, 
where ex is aligned with the singular vector corresponding to the maximum 
singular value dmax, and the vector x is aligned with the singular vector corre- 
sponding to the minimum singular value omin, then the output is 


A(x + €x) = OminX + OmaxEx - (1.32) 
= — a 
Eb 


The output signal-to-noise ||b|| /||€,|| is equal to the input signal-to-noise ||x||/||€xl| 
multiplied by a factor of Omin/Omax- In other words, the signal-to-noise of the 
output has been reduced by a factor equal to the condition number «(A). Even 
if the vectors x and ex are not perfectly aligned with the worst-case directions, 
they are likely to have some component of all of the singular vector directions. 


The authors are not immune to this, having mistakenly used this fictional identity in an 
early version of [134]. 
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In this case, components of the error will still experience large amplification 
relative to other components of the desired output. 

A similar issue arises when solving for x given an imperfectly specified b 
with some error €,. Now the worst-case scenario is where ep is aligned with 
the singular vector corresponding to the minimum singular value omin, and the 
vector b is aligned with the singular vector corresponding to the maximum 
singular value dmax. Then the estimated solution for x + €x is 


1 1 
bp ep (1.33) 


Omax Omin 


x+e, ~ AÏ (b + €p) = 


The signal-to-noise of the estimated x has also been reduced, or degraded, by 
a factor equal to the condition number «(A). 

One approach to mitigate a large condition number is to truncate the SVD 
more aggressively, essentially increasing the effective minimum singular value 
Omin. However, this comes at the cost of decreasing the size of the subspace U 
used to approximate the output. 


>> kappanew = le>; s Desired condition mumber 

>> [U,S,V] = svd(A,’econ’ ) 

>> r = max (find (diag(S) >max(S(:))*x*kKappanew) ); 

>> invA = Ve, ler) siny (o (Tor lesa) Ui (sete) ns Approximate 


One-Dimensional Linear Regression 


Regression is an important statistical tool to relate variables to one another 
based on data [477]. Consider the collection of data in Fig. The red crosses 
are obtained by adding Gaussian white noise to the black line, as shown in 
Code We assume that the data is linearly related, as in (1.25), and we use 
the pseudo-inverse to find the least-squares solution for the slope x below (blue 
dashed line), shown in Code[1.2} 


b| = |a| z2=USV cz. (1.34a) 


— 2=VS'U'b. (1.34) 


In (1.34b), © = |\all2, V = 1, and U = a/|lalla. Taking the left pseudo-inverse: 
a*b 


D= 
llall 


(1.35) 


This makes physical sense, if we think of x as the value that best maps our 
vector a to the vector b. Then, the best single value x is obtained by taking 
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True line 
l| * Noisy data x 
- -- Regression ling ef 


Figure 1.9: Illustration of linear regression using noisy data. 


the dot product of b with the normalized a direction. We then add a second 
normalization factor ||a||, because the ain is not normalized. 

Note that strange things happen if you use row vectors instead of column 
vectors in (1.34) above. Also, if the noise magnitude becomes large relative to 
the slope x, the pseudo-inverse will undergo a phase change in accuracy, re- 
lated to the hard-thresholding results in subsequent sections. 


Code 1.2: [MATLAB] Least-squares fit of noisy data in Fig. 


@ Generate noisy data 
x = 3; % True slope 
a c Pe Qn es 

b = axx + lxrandn(size(a)); 
plota, xa R) 

hold on, plot (a,b,'rx') 


Add noise 
True relationship 
Noisy measurements 


ale o o 


% Compute least-squares approximation with the SVD 
[ 


U,S,V] = svd(a,’econ’); 
xtilde = Vx*inv(S) «U’ *b; % Least-square fit 
plot(a;xztildesa, B=) On POE ETE 


3% Alternative formulations of least-squares 
xtildel = Vxinv(S)*U’'x*þb 

xtilde2 = pinv(a)x«b 

xtilde3 = regress (b,a) 
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Code 1.2: [Python] Least-squares fit of noisy data in Fig. 
x = 3 # True slope 
a = np.arange(-2,2,0.25) 
a = a.reshape(-1, 1) 
b = xxa + np.random.randn(xa.shape) # Add noise 
plt.plot(a, x*a, Color" k”; LineWidth=2, label=’True line’) 
# True relationship 
Dit .pllot(a, b; ax; Coloc: Merkersaze = 107, lobel—"Nousy 


data’) # Noisy measurements 


# Compute least-squares approximation with the SVD 
U, S; VE = np. lonalg.svd (a, cull matrices- Falise) 


square fit 


Regression line’) 


# Alternative formulations of least squares 


xteilde2 = np- linalo- pinv (a) € p 


Multi-linear Regression 


Example 1: Cement Heat Generation Data 


xtalde = VI. T R np. l inalg.iny (mp:diag(S)) CU Tab 4 Least 


plt.ploti(a,;xtilde x a,*——! ,Collor='b’ ,lineWidth=4, label= 


zeilde: yr. r We np- linalg:- iny (nip dea is) 2) QU T A 6 


First, we begin with a simple built-in MATLAB data set that describes the heat 
generation for various cement mixtures that comprise four basic ingredients 


(see Fig. [1.10). In this problem, we are solving (1.25), where A € 


R13x4 


, since 


there are four ingredients and heat measurements for 13 unique mixtures. The 
goal is to determine the weighting x that relates the proportions of the four 
ingredients to the heat generation. It is possible to find the minimum error so- 
lution using the SVD, as shown in Code Alternatives, using regress and 


pinv, are also explored. 


Code 1.3: [MATLAB] Multi-linear regression for cement heat data. 


load hald; % Load Portlant Cement dataset 

A = ingredients; 

b = heat; 

[U,S,V] = svd(A,’econ’); 

xX = Vxinv(S) *U’ xb; 3% Solve Ax=b using the SVD 
plot (b,”k”"); hold on < Plote data 

prot (Asz r o, 2 PIOC EIE 
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z3 | Ingredient Regression 
[7100 [1 Tricalcium aluminate 2.1930 
5 40 | 2 Tricalcium silicate 1.1533 
| | 3 | Tetracalcium alumiferrite 0.7585 
E 80 | 4 | Beta-dicalcium silicate 0.4863 

70 
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Mixture 


Figure 1.10: Heat data for cement mixtures containing four basic ingredients. 


x 
| 


= regress (b,A); 
= pinv (A) «b; 


Alternative 1 (regress) 
Alternative 2 (pinv) 


x 
| 


Code 1.3: [Python] Multi-linear regression for cement heat data. 


# Load dataset 

A = np. loadtxt (Os path. Jorn.. 7% DATA, hald ingredvents). 
csv’),delimiter=’,’) 

b = np. loadcuxr (os. path. Jorn oe 7 DATA hald heat. csv), 


delimiter=’,’) 


# Solve Ax=b using SVD 
UPS, Vi- np. linalg.svd (A, cull matrices- 0) 
Vie amp. lina lg- iny (np -dirag (Goin) C U T aD 


x 
II 


plt.plot (b, Color='k', LineWidth=2, label=’Heat Data’) 
plt.plot (A@x, ’-o’, Color='’r’, label=" Regression’ ) 


O 


x 
II 


np.linalg.pinv (A) *b # Alternative 


Example 2: Boston Housing Data 


In this example, we explore a larger data set to determine which factors best 
predict prices in the Boston housing market [313]. This data is available from 
the UCI Machine Learning Repository [33]. 

There are 13 attributes that are correlated with house price, such as per- 
capita crime rate and property-tax rate. These features are regressed onto the 
price data, the best-fit price prediction is plotted against the true house value 
in Fig. and the regression coefficients are shown in Fig. Although the 
house value is not perfectly predicted, the trend agrees quite well. It is often the 
case that the highest value outliers are not well captured by simple linear fits, 
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Figure 1.11: Multi-linear regression of home prices using various factors: (a) un- 
sorted data and (b) data sorted by home value. 
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Figure 1.12: Significance of various attributes in the regression. 


as in this example. 

This data contains prices and attributes for 506 homes, so the attribute ma- 
trix is of size 506 x 13. It is important to pad this matrix with an additional 
column of ones, to take into account the possibility of a non-zero constant off- 
set in the regression formula. This corresponds to the “y intercept” in a simple 
one-dimensional linear regression. The code for this example is nearly identical 
to the example above, and is available on the book’s GitHub. 


1.5 Principal Component Analysis (PCA) 

Principal component analysis (PCA) is one of the central applications of the 
SVD, providing a statistical interpretation of the data-driven, hierarchical coor- 
dinate system used to represent high-dimensional correlated data. This coordi- 


nate system involves the correlation matrices described in Section [1.3] Impor- 
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tantly, PCA pre-processes the data by mean subtraction and setting the variance 
to unity before performing the SVD. The geometry of the resulting coordinate 
system is determined by principal components (PCs) that are uncorrelated (or- 
thogonal) to each other, but have maximal correlation with the measurements. 
This theory was developed in 1901 by Pearson [552], and independently by 
Hotelling in the 1930s [339] B40]. Jolliffe provides a good reference text. 

Often in statistics, a number of measurements are collected in a single ex- 
periment, and these measurements are typically arranged into a row vector. 
The measurements may be features of an observable, such as demographic fea- 
tures of a specific human individual. A number of experiments are conducted, 
and each measurement vector is arranged as a row ina large matrix X, resem- 
bling the structure of how data is recorded in a spreadsheet. In the example of 
demography, the collection of experiments may be gathered via polling. Note 
that this convention for X, consisting of rows of features, is different than the 
convention throughout the remainder of this chapter, where individual feature 
“snapshots” are arranged as columns. However, we choose to be consistent 
with PCA literature in this section. The matrix will still be size n x m, although 
it may have more rows than columns, or vice versa. 


Computation 


We now compute the average row x (i.e., the mean of all rows), and subtract it 
from X. The mean x is given by 


i 
and the mean matrix is 
1 
X= 2] =: (1.37) 
1 


Subtracting X from X results in the mean-subtracted data B: 
B=X-X. (1.38) 
The covariance matrix of B is given by 


1 
n— 1 


C= 


BB. (1.39) 


Note that the covariance is normalized by n — 1 instead of n, even though there 
are n sample points. This is known as Bessel’s correction, which compensates 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


1.5. PRINCIPAL COMPONENT ANALYSIS (PCA) 29 


for the fact that the sample variance is biased because it does not capture the 
variance of the sample mean X about the true mean. The covariance matrix 
C is symmetric and positive semi-definite, having non-negative real eigenval- 
ues. Each entry C;; quantifies the correlation of the i and j features across all 
experiments. 

The principal components are the eigenvectors of C, and they define a change 
of coordinates in which the covariance matrix is diagonal: 


CV=VD = C=VDV* = D=V°CV. (1.40) 


The columns of the eigenvector matrix V are the principal components, and 
the elements of the diagonal matrix D are the variances of the data along these 
directions. This transformation is guaranteed to exist, since C is Hermitian and 
the columns of V are orthonormal. In these principal component coordinates, 
all features are linearly uncorrelated with each other. 

The matrix of principal components V is also the matrix of right singular 
vectors of B. Substituting B = UXV* into and comparing with 
yields 


C= 


1 1 1 
n—1 n—1 n-—1 


y, (1.41) 


The variance of the data in these coordinates, given by the diagonal elements 
A, Of D, is related to the singular values as 


2 
a (1.42) 


n—-1 


Thus, the SVD provides a numerically robust approach for computing the prin- 
cipal components. An approximation B obtained by keeping only the first r 
principal components will have a missing variance related to the squared Frobe- 
nius norm error in (1.7). 


The pca Command 


In MATLAB, there the additional commands pca and princomp (based on pca) 
for the principal component analysis: 


|| >> [¥, score, s2] = peal(x)- 


The matrix V is equivalent to the V matrix from the SVD of B = X — X, up to 
sign changes of the columns. The vector s2 contains eigenvalues of the covari- 
ance of B, also known as principal component variances; these values are the 
squares of the singular values. The variable score simply contains the coordi- 
nates of each row of B (the mean-subtracted data) in the principal component 
directions. In general, we often prefer to use the svd command with the various 
pre-processing steps described above. 
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Table 1.1: Standard deviation of data and normalized singular values. 


O71 02 
Data 2 0.5 
SVD || 1.974 | 0.503 


Example: Noisy Gaussian Data 


Consider the noisy cloud of data in Fig.|1.13{a), generated using Code|1.4] The 
data is generated by selecting 10000 vectors from a two-dimensional normal 
distribution with zero mean and unit variance. These vectors are then scaled in 
the x and y directions by the values in Table[1.1|and rotated by 7/3. Finally, the 
entire cloud of data is translated so that it has a non-zero center xç = [2 1] - 

Using Code[1.4] the PCA is performed and used to plot confidence intervals 
using multiple standard deviations, shown in Fig.[1.13(b). The singular values, 
shown in Table match the data scaling. The matrix U from the SVD also 
closely matches the rotation matrix, up to a sign on the rows: 


R= 0.5 0.8660 U= —0.4998 —0.8662 
/3 > [0.8660 0.5 |’ =- |—0.8662 0.4998 |` 


Note that R.,/3 is a rotation matrix designed to rotate row vectors by multiplica- 
tion on the right by R,;3. For rotation of column vectors by left multiplication, 
this matrix would be transposed. 


Code 1.4: [MATLAB] PCA example on noisy cloud of data. 


% Generate noisy cloud of data 
xC = [2, Ill; 
sig (ai 48) 5 


Center of data (mean) 
Principal axes 


oe o 


theta = pi/3; 
R = [cos (theta) sin (theta); 
-sin (theta) cos(theta) ]; 


Rotate cloud by pi/3 
Rotation matrix 


oo oo 


nPoints = 10000; % Create 10,000 points 

X = randn(nPoints,2)xdiag(sig)*R + ones (nPoints, 2) *diag(xC) ; 
scatter Ole) eC) k: Lanewiden’ 2) 2° Plot data 

% Compute PCA and plot confidence intervals 

Xavg = mean(X,1); * Compute mean 

Be X- cones (mP omnes, ll) Kawa: 4% Mean-subtracted Data 
[U,S,V] = svd(B/sqrt (nPoints),’econ’); % PCA via SVD 


o 
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Figure 1.13: (a) Principal components capture the variance of mean-subtracted 
Gaussian data. (b) The first three standard deviation ellipsoids (red), and the 
two left singular vectors, scaled by singular values (c1vı + xc and o2v2 + zc, 
cyan), are shown. 


theta = (0:.01:1)«2«pi; 

Xstd = [eos(theta’) sin(theta’)]*«S*V'’; %# lstd conf. interval 
hord'on, plot (cay qtrxist), xawo(2 et see, aA ) 
prot (xag CN E2xXs td Cr Dr Xava (7) 2A Ed (Gt, 2 a) 

prot (xavg CNS td Cy, D Xava (2A T Saed Ai r) 


Code 1.4: [Python] PCA example on noisy cloud of data. 


# Generate noisy cloud of data 


e ee e Eaa (lee Abl))) # Center of data (mean) 

Sig = Mp-array (i, 0251) # Principal axes 

theta = np.pi/3 # Rotate cloud by pi/3 

R= op. array (| (np. cos (theta), -np.sim(ithetra) l 7 Rotakilon mat 
[np.sin(theta),np.cos (theta) ]]) 

APoantS = 10000 # Create 10,000 points 


X R anp diag(srg) GC mp random. randi 2; nPoints) E MEL diag 
xC) @ np.ones((2,nPoints) ) 

axl -plot GO heh color k) y Plot data 

Xavg = np.mean(X,axis=1) # Compute mean 

B = X —- np.tile(Xavg, (nPoints,1)).T # Mean-subtracted data 
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# Find principal components (SVD) 

U, S, VT = np.linalg.svd(B/np.sqrt (nPoints) , full_matrices=0) 

theta = 2% np- -pi + np -arange (0/71), 0.201) 

Xstd = U @ np.diag(S) @ np.array([np.cos(theta),np.sin(theta 
)]) 

axe plots awcr WOME Xs Ed rO l Xav ll seada a r eoler 
'r’ ,LineWidth=3) 


he 
axo plot q avg TEA ceise (Open. See cibl|, ore Aeoceicrol(| lai) = 
color=' r’,LineWidth=3) 
Ebec siolmene (Gch ep ll ae eke. Genel a Iie Rax eae Sie. dcncell| alee lle te 


color=’r’,LineWidth=3) 


Finally, it is also possible to compute using the pca command in MATLAB: 


>> [V,score,s2] = pca(X); 
>> norm(scorexV — B) 
ans = 

1.4900e-13 


Example: Ovarian Cancer Data 


The ovarian cancer data set, which is built into MATLAB, provides a more re- 
alistic example to illustrate the benefits of PCA. This example consists of gene 
data for 216 patients, 121 of whom have ovarian cancer, and 95 of whom do 
not. For each patient, there is a vector of data containing the expression of 4000 
genes. There are multiple challenges with this type of data, namely the high 
dimension of the data features. However, we see from Fig. [1-14] that there is sig- 
nificant variance captured in the first few PCA modes. Said another way, the 
gene data is highly correlated, so that many patients have significant overlap 
in their gene expression. The ability to visualize patterns and correlations in 
high-dimensional data is an important reason to use PCA, and PCA has been 
widely used to find patterns in high-dimensional biological and genetic data 
[588]. 

More importantly, patients with ovarian cancer appear to cluster separately 
from patients without cancer when plotted in the space spanned by the first 
three PCA modes. This is shown in Fig. which is generated by Code 
This inherent clustering in PCA space of data by category is a foundational 
element of machine learning and pattern recognition. For example, we will see 
in Section [1.6] that images of different human faces will form clusters in PCA 
space. The use of these clusters will be explored in greater detail in Chapter 5} 


Code 1.5: [MATLAB] Compute PCA for ovarian cancer data. 


|| load ovariancancer; % Load ovarian cancer data 
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Figure 1.14: Singular values for the ovarian cancer data. 
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Figure 1.15: Clustering of samples that are normal and those that have cancer 
in the first three principal component coordinates. 


[U,S,V] = svd(obs,’econ’); 
for i=l:size(obs,1) 

Ke WG, obs are); 

y= (G2) obs (a, in; 

2 = Wiles, 3)" Obs (a, ss) 

if (grp{i}=='’Cancer’ ) 

pLot3 (x, y,2, £x," Linewidth”, 2); 
else 


plot3 (x,y,z, bo” , LineWidth”,2); 
end 
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|| end 
Code 1.5: [Python] Compute PCA for ovarian cancer data. 
obs = Np. loOadexs (OS path. JOrn oan DATA  “ovariancancer ODS 
.csv’),delimiter=’,’) 
f — open (os path. Jorn.. 7 DATA” ovardancancer grp,.cSsy ), 


we) 


grp = f.read (O) -Split (TAn) 


U, S, VT = np.linalg.svd(obs, full_matrices=0) 
for j in range(obs.shape[0]): 
oe VO I AC obs eal ede 
ee Ae [Ae Ge eae aE 
roe Na eB ll 1 erok=' | aie lea 


if grp) ~~ Cancer’ =: 


ax.scatter(x,y,z,marker=' x’, color='r’,s=50) 
else: 


ax.scatter(x,y,z,marker=' 0’,color='b’,s=50) 


1.6 Eigenfaces Example 


One of the most striking demonstrations of SVD/PCA is the so-called eigen- 
faces example. In this problem, PCA (i.e., SVD on mean-subtracted data) is 
applied to a large library of facial images to extract the most dominant cor- 
relations between images. The result of this decomposition is a set of eigenfaces 
that define a new coordinate system. Images may be represented in these co- 
ordinates by taking the dot product with each of the principal components. It 
will be shown in Chapter [5] that images of the same person tend to cluster in 
the eigenface space, making this a useful transformation for facial recognition 
and classification [67,687]. The eigenface problem was first studied by Sirovich 
and Kirby in 1987 and expanded on in [388]. Its application to automated 
facial recognition was presented by Turk and Pentland in 1991 [728]. 

Here, we demonstrate this algorithm using the Extended Yale Face Database 
B [274], consisting of cropped and aligned images of 38 individuals (28 
from the extended database, and 10 from the original database) under nine 
poses and 64 lighting conditions/] Each image is 192 pixels tall and 168 pix- 
els wide. Unlike the previous image example in Section each of the fa- 
cial images in our library has been reshaped into a large column vector with 
192 x 168 = 32256 elements. We use the first 36 people in the database (left 


’The Yale database can be downloaded at http://vision.ucsd.edu/~iskwak/ 
ExtYaleDatabase/ExtYaleB.htm]l 
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Figure 1.16: (left) A single image for each person in the Yale database, and 
(right) all images for a specific person. Left panel generated by Code[1.6] 


panel of Fig.|1.16) as our training data for the eigenfaces example, and we hold 
back two people as a test set. An example of all 64 images of one specific per- 
son are shown in the right panel. These images are loaded and plotted using 


Code 
Code 1.6: [MATLAB] Plot image for each person in the Yale database (Fig.|1.16). 


load ../DATA/allFaces.mat 
allPersons = zeros(nx*6,m*6); % Make array to fit all faces 
count = 1; 
for i- l;6 26 x 6 grid of faces 
for j=1:6 
allPersons lt (i-1)*n:i*n,1+(j-1) *m:j*m) 
=reshape (faces(:,1+sum(nfaces(1:count-1))),n,m); 
count — Count r l; 
end 
end 
imagesc(allPersons), colormap gray 


Code 1.6: [Python] Plot image for each person in the Yale database (Fig.|1.16). 


mat contens — scipy. io loadmat (0s path Join 2.7 7 DATA’ |. 
allFaces.mat’)) 

faces = mat_contents[’ faces’ ] 

m = int (mat_contents[’m’]) 
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n = int (mat contents n”) 

nfaces = np.ndarray.flatten(mat_contents[’nfaces’ ]) 

allPersons = np.zeros((nx6,m*«6) ) 

count = 0 

for j in range(6): 

for k in range(6): 
alliPersons [jrn o (J) kam o (keH smi) = snp. 
reshape (faces[:,np.sum(nfaces[:count])], (m,n)).T 

count += 1 

img = plt.imshow(allPersons) 


As mentioned before, each image is reshaped into a large column vector, 
and the average face is computed and subtracted from each column vector. 
The mean-subtracted image vectors are then stacked horizontally as columns 
in the data matrix X, as shown in Fig. Thus, taking the SVD of the mean- 
subtracted matrix X results in the PCA. The columns of U are the eigenfaces, 
and they may be reshaped back into 192 x 168 images. This is illustrated in 
Code[L.7] 


Code 1.7: [MATLAB] Compute eigenfaces on mean-subtracted data. 


X = 
[ 


U,S 


imagesc (reshape (avgFace,n,m) ) 
imagesc (reshape (U(:,1),n,m) ) 


% We use the first 36 people for training data 
trainingFaces = faces(:,1l1:sum(nfaces(1:36))); 
avgFace = mean(trainingFaces,2); oe size nam by 1y 


% Compute eigenfaces on mean-subtracted training data 


trainingFaces-—avgFacexones (1,size(trainingFaces,2))j; 
,V] = svd(X,’econ’); 


Plot avg face 
Plot first eigenface 


a 
a 


Code 1.7: [Python] Compute eigenfaces on mean-subtracted data. 


ARS 


U, 


# We use the first 36 people for training data 
trainingFaces = faces[:, :np.sum(nfaces[:36])] 
avgFace = np.mean (trainingFaces,axis=1) # size n*m by 1 


# Compute eigenfaces on mean-subtracted training data 


trainingFaces - np.tile(avgFace, (trainingFaces.shape 


Eagan) ca 

Ss, VI = np.linalgusvd (4,fulll makrirces=0)) 
img_avg = axl.imshow(np.reshape(avgFace, (m,n) ).T) 
img ul = ax2.amshow (np.reshape (Ui[:,0], (m,n) ) .T) 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


1.6. EIGENFACES EXAMPLE 37 


Mean-subtracted faces 


Person 1 Person2 Person 3 Person k 


bag 


atl 


CE 


| ant d 
verage 
face 


D 


= 
fo) 


ular value, o, 
oO 


= 
fo) 


1000 2000 
r 


| Sing 


Eigenfaces 


Figure 1.17: Schematic procedure to obtain eigenfaces from library of faces X 
after subtracting off average face X. 


Using the eigenfaces library, U, obtained above, we now attempt to approx- 
imately represent an image that was not in the training data. At the beginning, 
we held back two individuals (the 37th and 38th people), and we now use one 
of their images as a test image, Xtest. We will see how well a rank-r SVD basis 
will approximate this image using the following projection: 


~ Ty T* 
Xtest = UU Xtest- 


The eigenface approximation for various values of r is shown in Fig. as 
computed using Code The approximation is relatively poor for r < 200, 
although for r > 400 it converges to a passable representation of the test image. 
It is interesting to note that the eigenface space is not only useful for rep- 
resenting human faces, but may also be used to approximate a dog (Fig. 
or a cappuccino (Fig. |1.20). This is possible because the 1600 eigenfaces span a 
large subspace of the 32 256-dimensional image space corresponding to broad, 
smooth, non-localized spatial features, such as cheeks, forehead, mouths, etc. 
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Test image r= 50 r = 100 


r = 400 F= 800 


Figure 1.18: Approximate representation of test image using eigenfaces basis of 
various order r. Test image is not in training set. 


Code 1.8: [MATLAB] Approximate test image omitted from training data. 


testFace = faces(:,1+sum(nfaces(1:36))); % Person 37 

best PacemMs — EestPace avgEACe,; 

for e [25 50 100 200 400 800 1600) 
reconFace = avgFace + (U(:,1:r)*(U(:,1l:r)’x*xtestFaceMS) ); 
imagesc (reshape (reconFace,n,m) ) 

end 


Code 1.8: [Python] Approximate test image omitted from training data. 
testFace = faces[:,np.sum(nfaces[:36])] # Person 37 
GeSeraGeMs = PeSthace.] cave race 
e ree = or S0 OOP eA OO 4 OOF SO Oa EN 
for r in r list: 
reconFace = avgFace + U[:,:r] @ U[:,:r].T @ testFaceMS 
img = plt.imshow(np.reshape (reconFace, (m,n)) .T) 


We further investigate the use of the eigenfaces as a coordinate system, 
defining an eigenface space. By projecting an image x onto the first r PCA 
modes, we obtain a set of coordinates in this space: x = U x. Some principal 
components may capture the most common features shared among all human 
faces, while other principal components will be more useful for distinguish- 
ing between individuals. Additional principal components may capture differ- 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


1.6. EIGENFACES EXAMPLE 39 


Test image 


g me 


Figure 1.19: Approximate representation of an image of a dog using eigenfaces. 


Test image r = 100 


Figure 1.20: Approximate representation of a cappuccino using eigenfaces. 
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0 
Poo 
Figure 1.21: Projection of all images from two individuals onto the 5th and 6th 
PCA modes. Projected images of the first individual are indicated with black 
diamonds, and projected images of the second individual are indicated with 
red triangles. Three examples from each individual are circled in blue, and the 
corresponding image is shown. 


ences in lighting angles. Figure[1.21]shows the coordinates of all 64 images of 
two individuals projected onto the 5th and 6th principal components, gener- 
ated by Code|1.9} Images of the two individuals appear to be well separated in 
these coordinates. This is the basis for image recognition and classification in 
Chapter 5| 


Code 1.9: [MATLAB] Project images for two specific people onto the 5th and 
6th eigenfaces to illustrate the potential for automated classification. 


Pinum = 2; % Person number 2 

P2num = 7; æ% Person number 7 

P1 = faces(:,1+sum(nfaces(1:Plnum-1)):sum(nfaces(1:Plnum))); 
P2 = faces(:,1+sum(nfaces (1:P2num-1)):sum(nfaces(1:P2num))); 
Pl = Pl - avgFacexones(1,size(P1,2)); 

P2 = P2 - avgFacexones(1,size(P2,2)); 

PCAmodes = [5 6]; @ Project onto PCA modes 5 and 6 
PCACoordsPl = U(:,PCAmodes)’*P1; 

PCACoordsP2 = U(:,PCAmodes)’*P2; 
pLotiPCACoordsP1 (17), BCACoorasP 127s), kd") hord on 

plot (PCACoordsP2 (1, :),,PCACOordsP2(27:); ©") 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


1.7. TRUNCATION AND ALIGNMENT 41 


Code 1.9: [Python] Project images for two specific people onto the 5th and 6th 
eigenfaces to illustrate the potential for automated classification. 


Plinum = 2 # Person number 2 

P2num = 7 # Person number 7 

Pl = faces[:,np.sum(nfaces[: (Plnum-1)]):np.sum(nfaces[:Plnum 
a 

P2 = faces[:,np.sum(nfaces[: (P2num-1)]):np.sum(nfaces[:P2num 
])] 

PIL Pl = np ti le(avgtace, (RPI. shape DIT 

P2 = P2 np tile(avgrace, (P2. shape Dh 2a 

PCAmodes = [5, 6] # Project onto PCA modes 5 and 6 

PCACoordsP1l = U[:,PCAmodes-np.ones_like(PCAmodes)].T @ Pl 

PCACoordsP2 = U[:,PCAmodes-np.ones_like(PCAmodes)].T @ P2 

plt: plot (PCACoordsP I lOve |, -cacoordsPd (yl sole eolor— k) 

plea pilot (PEAtcords® 2 ll; (,eehcoords (lei, + Color =E) 


1.7 Truncation and Alignment 


Deciding how many singular values to keep, i.e., where to truncate, is one 
of the most important and contentious decisions when using the SVD. There 
are many factors, including specifications on the desired rank of the system, 
the magnitude of noise, and the distribution of the singular values. Often, one 
truncates the SVD at a rank r that captures a predetermined amount of the vari- 
ance or energy in the original data, such as 90% or 99% truncation. Although 
crude, this technique is commonly used. Other techniques involve identifying 
“elbows” or “knees” in the singular value distribution, which may denote the 
transition from singular values that represent important patterns from those 
that represent noise. Truncation may be viewed as a hard threshold on singular 
values, where values larger than a threshold 7 are kept, while remaining singu- 
lar values are truncated. Recent work by Gavish and Donoho provides an 
optimal truncation value, or hard threshold, under certain conditions, provid- 
ing a principled approach to obtaining low-rank matrix approximations using 
the SVD. 

In addition, the alignment of data significantly impacts the rank of the SVD 
approximation. The SVD essentially relies on a separation of variables between 
the columns and rows of a data matrix. In many situations, such as when an- 
alyzing traveling waves or misaligned data, this assumption breaks down, re- 
sulting in an artificial rank inflation. 
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Optimal Hard Threshold 


A recent theoretical breakthrough determines the optimal hard threshold 7 for 
singular value truncation under the assumption that a matrix has a low-rank 
structure contaminated with Gaussian white noise [267]. This work builds on a 
significant literature surrounding various techniques for hard and soft thresh- 
olding of singular values. In this section, we summarize the main results and 
demonstrate the thresholding on various examples. For more details, see [267]. 

First, we assume that the data matrix X is the sum of an underlying low- 
rank, or approximately low-rank, matrix Xtue and a noise matrix Xnoise: 


X = X true T YX noise- (1.43) 


The entries of Xnoise are assumed to be independent, identically distributed 
(i.i.d.) Gaussian random variables with zero mean and unit variance. The mag- 
nitude of the noise is characterized by y, which deviates from the notation in 

When the noise magnitude y is known, there are closed-form solutions for 
the optimal hard threshold 7: 


1. If X € R”*” is square, then 


T = (4/V3)/n7. (1.44) 


2. If X € R”*” is rectangular and m < n, then the constant 4/ /3 is replaced 
by a function of the aspect ratio 8 = m/n: 


T = X(B)Vny, (Atal 


1/2 
86 
A(B) = | 2(8+1)4 . 1.46 

se (2 crys") sa 
Note that this expression reduces to (1.44) when 6 = 1. If n < m, then 
p=njm: 


When the noise magnitude y is unknown, which is more typical in real- 
world applications, then it is possible to estimate the noise magnitude and scale 
the distribution of singular values by using omea, the median singular value. In 
this case, there is no closed-form solution for 7, and it must be approximated 
numerically: 


3. For unknown noise y, and a rectangular matrix X € R"*™, the optimal 
hard threshold is given by 


T= w(8)Omed- (1.47) 


8In [267], o is used to denote standard deviation and y denotes the kth singular value. 
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Here, w(3) = A(8)/ ug, where ug is the solution to the following problem: 


i (Ua + vB? -dt - vay? y L 
( 2 


1—8)2 2rt ~ 


Solutions to the expression above must be approximated numerically. 
Fortunately [267] has a MATLAB code supplement’ |206] to approximate 


He- 


The new method of optimal hard thresholding works remarkably well, as 
demonstrated on the examples below. 


Example 1: Toy Problem 


In the first example, shown in Fig. and Code we artificially construct 
a rank-two matrix and contaminate the data with Gaussian white noise. A de- 
noised and dimensionally reduced matrix is then obtained using the threshold 
from (1.44), as well as a truncation keeping 90% of the cumulative sum of sin- 
gular values. It is clear that the hard threshold is able to filter the noise more 
effectively. Plotting the singular values in Fig. it is clear that there are two 
values that are above threshold. 


Code 1.10: [MATLAB] Compare various thresholding approaches on noisy 


low-rank data (Fig.|1.22). 
* Generate underlying low-rank data 
ES (a oTe; 
Utrue = [cos(17*t).*exp(-t.72) sin(11»t)]; 
cerus [2 Os © «Sls 
Vtrue = [sin(5*t).*exp(-t.°2) cos(13*t)]; 
X = UtruerStrue*«Vtrue’; 


g 


o 


% 


N 


figure, imshow(X); 


Contaminate signal with noise 


Sigma = 1; 
Xnoisy = X+Sigmaxrandn (size (X)); 
figure, imshow(Xnoisy); 


Thuncate using Opeimal hard threshold 


U,S,V] = svd(Xnoisy); 


= size(Xnoisy,1); 


*http:/ /purl.stanford.edu/vg705qn9070 
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cutoff = (4/sqrt(3))*sqrt(N)*sigma; % Hard threshold 
r = max(find(diag(S)>cutoff)); % Keep modes w/ sig > cutoff 


Xelean -U(r SN ar r ee (aoe rN; 
figure, imshow(Xclean) 


% Truncate to keep 90% of cumulative sum 


cdS = cumsum (diag (S) ) ./sum (diag (S)); % Cumulative sum 
r90 = min(find(cdS>0.90)); 2 kind r to capture 903 of sum 


KOO = U(:; le SOS (lets OO 1 E90) sV (sells SiO) et ae 
figure, imshow(X90) 


Code 1.10: [Python] Compare various thresholding approaches on noisy low- 


rank data (Fig.|1.22). 


# Generate underlying low-rank data 
t = np.arange(-3,3,0.01) 


Ubeve = np array Ino. cos (xte) = AP exo (SE x2) 7, OP- SIn (Tike) 
il ees 

Serle np-array [12 OI OF Oro is) 

VEerue = npwarray ([np.sin(S*t) x np.exp(—-txs2), np. cos (134C) 
et 


X = Utrue @ Strue @ Vtrue.T 
plt.imshow (X) 


# Contaminate signal with noise 

sigma = 1 

Xnoisy = X + sigmaxnp.random.randn(«X.shape) 
plt.imshow (Xnoisy) 


# Truncate using optimal hard threshold 
US Vi- np. Ibi aligns wal(nnioa shai elmalt mane 0) 


N = Xnoisy.shape[0] 


cutoff = (4/np.sqrt(3)) = np.sqrt(N) < sigma #Hard threshold 
r = np.max(np.where(S > cutoff)) # Keep modes w/ S > cutoff 
Selean — Wiis (etl) 0 np- drag(Sii: (Et) e Vania ie 


plt.imshow (Xclean) 

# Truncate to keep 90% of cumulative sum 

cdS = np.cumsum(S) / np.sum(S) # Cumulative energy 

r90 = np.min(np.where(cdS > 0.90)) # Find r to keep 90% sum 


Dee We — Ui [es S (te Octal) a e Ee s(t Ota) a] aides Vile [esa (ne, S.Ct) aes] 
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(a) Original (b) 


(d) 90% cutoff 


Figure 1.22: (a) Underlying rank-two matrix, (b) matrix with noise, (c) clean 
matrix after optimal hard threshold (4//3),/no, and (d) truncation keeping 
90% of the cumulative sum of singular values. 


| plt.imshow (X90) 


Example 2: Eigenfaces 


In the second example, we revisit the eigenfaces problem from Section|1.6} This 
provides a more typical example, since the data matrix X is rectangular, with 
aspect ratio 6 = 3/4, and the noise magnitude is unknown. It is also not clear 
that the data is contaminated with white noise. Nonetheless, the method de- 
termines a threshold 7, above which columns of U appear to have strong fa- 
cial features, and below which columns of U consist mostly of noise, shown in 


Fig. 


Importance of Data Alignment 


Here, we discuss common pitfalls of the SVD associated with misaligned data. 
The following example is designed to illustrate one of the central weaknesses of 
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Figure 1.23: (a) Singular values c, and (b) cumulative sum in first r modes. The 
optimal hard threshold 7 = (4//3),/no is shown as a red dashed line, and the 
90% cutoff is shown as a blue dashed line. For this case, n = 600 and o = 1, so 
that the optimal cutoff is approximately 7 = 56.6. 


Singular value, o, 


0 500 1000 1500 2000 
r 


Figure 1.24: Hard thresholding for eigenfaces example. 


the SVD for dimensionality reduction and coherent feature extraction in data. 
Consider a matrix of zeros with a rectangular sub-block consisting of ones. As 
an image, this would look like a white rectangle placed on a black background 
(see Fig. [1-25{a)). If the rectangle is perfectly aligned with the x- and y-axes of 
the figure, then the SVD is simple, having only one non-zero singular value o: 
(see Fig. [1.25{c)) and corresponding singular vectors u; and vı that define the 
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(a) 0° rotation (b) 10° rotation 


(c) , (d) 

10 10 
A 
6 
© 10° 10° : 
3 diag(=) 
{avy 
> 10“ 10% 
<I 
2 -8 8 
3 10 107 
D 
D 10” i0” 

107° 107" 

0 250 500 750 1000 0 250 500 750 1000 
r r 


Figure 1.25: A data matrix consisting of ones with a square sub-block of zeros 
(a), and its SVD spectrum (c). If we rotate the image by 10°, as in (b), the SVD 
spectrum becomes significantly more complex (d). 


width and height of the white rectangle. 

When we begin to rotate the inner rectangle so that it is no longer aligned 
with the image axes, additional non-zero singular values begin to appear in 
the spectrum (see Figs. [1.25{b,d) and[1.26). Code to reproduce this example is 
provided on the book’s GitHub. 


The reason that this example breaks down is that the SVD is fundamentally 
geometric, meaning that it depends on the coordinate system in which the data 
is represented. As we have seen earlier, the SVD is only generically invariant 
to unitary transformations, meaning that the transformation preserves the in- 
ner product. This fact may be viewed as both a strength and a weakness of the 
method. First, the dependence of SVD on the inner product is essential for the 
various useful geometric interpretations. Moreover, the SVD has meaningful 
units and dimensions. However, this makes the SVD sensitive to the alignment 
of the data. In fact, the SVD rank explodes when objects in the columns trans- 
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(a) (b) 10° 


Singular value, o, 


0 250 500 750 1000 


Figure 1.26: A data matrix consisting of zeros with a square sub-block of ones 
at various rotations (a), and the corresponding SVD spectrum, diag(S), (b). 


late, rotate, or scale, which severely limits its use for data that has not been 
heavily pre-processed. 

For instance, the eigenfaces example was built on a library of images that 
had been meticulously cropped, centered, and aligned according to a stencil. 
Without taking these important pre-processing steps, the features and cluster- 
ing performance would be underwhelming. 

The inability of the SVD to capture translations and rotations of the data is a 
major limitation. For example, the SVD is still the method of choice for the low- 
rank decomposition of data from partial differential equations (PDEs), as will 
be explored in Chapters[12]and [13| However, the SVD is fundamentally a data- 
driven separation of variables, which we know will not work for many types 
of PDE, for example, those that exhibit traveling waves. Generalized decom- 
positions that retain the favorable properties and are applicable to data with 
symmetries is a significant open challenge in the field. 


1.8 Randomized Singular Value Decomposition 


The accurate and efficient decomposition of large data matrices is one of the 
cornerstones of modern computational mathematics and data science. In many 
cases, matrix decompositions are explicitly focused on extracting dominant 
low-rank structure in the matrix, as illustrated throughout the examples in this 
chapter. Recently, it has been shown that if a matrix X has low-rank structure, 
then there are extremely efficient matrix decomposition algorithms based on 
the theory of random sampling; this is closely related to the idea of sparsity 
and the high-dimensional geometry of sparse vectors, which will be explored in 
Chapter [| These so-called randomized numerical methods have the potential to 
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transform computational linear algebra, providing accurate matrix decomposi- 
tions at a fraction of the cost of deterministic methods. Moreover, with increas- 
ingly vast measurements (e.g., from 4K and 8K video, Internet of Things, etc.), it 
is often the case that the intrinsic rank of the data does not increase appreciably, 
even though the dimension of the ambient measurement space grows. Thus, 
the computational savings of randomized methods will only become more im- 
portant in the coming years and decades with the growing deluge of data. 


Randomized Linear Algebra 


Randomized linear algebra is a much more general concept than the treatment 
presented here for the SVD. In addition to the randomized SVD [488} [621], ran- 
domized algorithms have been developed for principal component analysis 
[308] |605], the pivoted LU decomposition [651], the pivoted QR decomposition 
[219], and the dynamic mode decomposition [234]. Most randomized matrix 
decompositions can be broken into a few common steps, as described here. 
There are also several excellent surveys on the topic [236} 809, [445] [471]. We as- 
sume that we are working with tall-skinny matrices, so that n > m, although 
the theory readily generalizes to short-fat matrices. 


Step 0: Identify a target rank, r < m. 


Step 1: Using random projections P to sample the column space, find a 
matrix Q whose columns approximate the column space of X, i.e., so that 
X x QQ*X. 


Step 2: Project X onto the Q subspace, Y = Q*X, and compute the matrix 
decomposition on Y. 


Step 3: Reconstruct high-dimensional modes U = QUy using Q and the 
modes computed from Y. 


Randomized SVD Algorithm 


Over the past two decades, there have been several randomized algorithms 
proposed to compute a low-rank SVD, including the Monte Carlo SVD 
and more robust approaches based on random projections [446}|488]/621]. These 
methods were improved by incorporating structured sampling matrices for 
faster matrix multiplications [761]. Here, we use the randomized SVD algo- 
rithm of Halko, Martinsson, and Tropp [809], which combined and expanded 
on these previous algorithms, providing favorable error bounds. Additional 
analysis and numerical implementation details are found in Voronin and Mar- 
tinsson [739]. A schematic of the rSVD algorithm is shown in Fig. 
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Figure 1.27: Schematic of randomized SVD algorithm. The high-dimensional 
data X is depicted in red, intermediate steps in gray, and the outputs in blue. 
This algorithm requires two passes over X. 


Step 1 


Step 2 


Step 1. We construct a random projection matrix P € R™*%" to sample the 
column space of X € R"*™: 
Z = XP. (1.48) 


The matrix Z may be much smaller than X, especially for low-rank matrices 
with r < m. It is highly unlikely that a random projection matrix P will project 
out important components of X, and so Z approximates the column space of X 
with high probability. Thus, it is possible to compute the low-rank QR decom- 
position of Z to obtain an orthonormal basis for X: 


Z=QR. (1.49) 


Step 2. With the low-rank basis Q, we may project X into a smaller space: 


Y = Q*X. (1.50) 
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It also follows that X ~ QY, with better agreement when the singular values 
ap decay rapidly for k > r. 
It is now possible to compute the singular value decomposition on Y: 


Y=Uypv" (1.51) 


Because Q is orthonormal and approximates the column space of X, the matri- 
ces X and V are the same for Y and X, as discussed in Section [1.3] 


Step 3. Finally, it is possible to reconstruct the high-dimensional left singular 
vectors U using Uy and Q: 
U = QUy. (1.52) 


Oversampling 


Most matrices X do not have an exact low-rank structure, given by r modes. 
Instead, there are non-zero singular values o, for k > r, and the sketch Z will 
not exactly span the column space of X. In general, increasing the number of 
columns in P from r to r + p significantly improves results, even with p adding 
around 5-10 columns [487]. This is known as oversampling, and increasing p 
decreases the variance of the singular value spectrum of the sketched matrix. 


Power Iterations 


A second challenge in using randomized algorithms is when the singular value 
spectrum decays slowly, so that the remaining truncated singular values con- 
tain significant variance in the data X. In this case, it is possible to pre-process 
X through q power iterations to create a new matrix X witha 
more rapid singular value decay: 


X@ — (XX*)°X. (1.53) 


Power iterations dramatically improve the quality of the randomized decom- 
position, as the singular value spectrum of X“ decays more rapidly: 


XM = UTV, (1.54) 


However, power iterations are expensive, requiring q additional passes through 
the data X. In some extreme examples, the data in X may be stored in a dis- 
tributed architecture, so that every additional pass adds considerable expense. 


Guaranteed Error Bounds 


One of the most important properties of the randomized SVD is the existence of 
tunable error bounds, which are explicit functions of the singular value spec- 
trum, the desired rank r, the oversampling parameter p, and the number of 
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power iterations q. The best attainable error bound for a deterministic algo- 
rithm is 

IX — QY ||2 > or41(X). (1.55) 
In other words, the approximation with the best possible rank-r subspace Q 


will have error greater than or equal to the next truncated singular value of X. 
For randomized methods, it is possible to bound the expectation of the error: 


1/(2q+1) 
(IX - QY|l2) < (1 | Veni | EE facr) ouus(X), (156 


where e is Euler’s number. 


Choice of Random Matrix P 


There are several suitable choices of the random matrix P. Gaussian random 
projections (e.g., the elements of P are i.i.d. Gaussian random variables) are 
frequently used because of favorable mathematical properties and the richness 
of information extracted in the sketch Z. In particular, it is very unlikely that a 
Gaussian random matrix P will be chosen badly so as to project out important 
information in X. However, Gaussian projections are expensive to generate, 
store, and compute. Uniform random matrices are also frequently used, and 
have similar limitations. There are several alternatives, such as Rademacher 
matrices, where the entries can be +1 or —1 with equal probability [720]. Struc- 
tured random projection matrices may provide efficient sketches, reducing com- 
putational costs to O(nm log(r)) [761]. Yet another choice is a sparse projection 
matrix P, which improves storage and computation, but at the cost of including 
less information in the sketch. In the extreme case, when even a single pass over 
the matrix X is prohibitively expensive, the matrix P may be chosen as random 
columns of the m x m identity matrix, so that it randomly selects columns of X 
for the sketch Z. This is the fastest option, but should be used with caution, as 
information may be lost if the structure of X is highly localized in a subset of 
columns, which may be lost by column sampling. 


Example of Randomized SVD 


To demonstrate the randomized SVD algorithm, we will decompose a high- 
resolution image. This particular implementation is only for illustrative pur- 
poses, as it has not been optimized for speed, data transfer, or accuracy. In 
practical applications, care should be taken [236] 309]. 

Code[1.11]computes the randomized SVD of a matrix X, and Code[1.12|uses 
this function to obtain a rank-400 approximation to a high-resolution image, 


shown in Fig. 
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Figure 1.28: Original high-resolution (left) and rank-400 approximations from 
the SVD (middle) and rSVD (right). 


Code 1.11: [MATLAB] Randomized SVD algorithm. 
function IU; S; V] = rsvd(Xx, r,a E); 


% Step 1: Sample column space of X with P matrix 
ny = size(X,2); 
P = randn(ny,r+p); 
A — AAP 
for k=1:q 
A = ek (Oe ere 
end 
[Q,R] = qr(Z,0); 


3 Step 2: Compute SVD on projected Y=Q’ «xX; 
Y = Of xxX- 

WUY; S; V] = svdi¥, econ"); 

U = Q«UY; 


Code 1.11: [Python] Randomized SVD algorithm. 


def cSVvD(X, C, a PI: 
# Step 1: Sample column space of X with P matrix 
ny = X.shape[1] 
P = np.random.randn (ny,r+p) 
Z=xX €@P 
for k in range (qd): 
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Z=X @ (XT @ 2Z) 
Q, R = np.linalg.qr(Z,mode=’ reduced’ ) 


# Step 2: Compute SVD on projected Y -OT @ xX 
Se Ode I 4 

UY, o7 VE = ne. linalg. -evd (Y, cull matr wces—0)) 
U= OTOT UY 


return U, S, VT 


Code 1.12: [MATLAB] Compute the randomized SVD of high-resolution image. 
A=imread(’ jupiter.jpg’); 
X=double (rgb2gray (A) ); 


USN] svd(X,’econ’); % Deterministic SVD 
r = 400; % Target rank 

ep = il % Power iterations 

p= 57 % Oversampling parameter 

EU, es, eV) = rsvd(%,2,a, >); < Randomized SVD 


3% Reconstruction 

XSD = UE EaSI Sic LEENA er EE; s VD appr ON: 
errSVD = norm(X-XSVD, 2) /norm(X, 2) ; 

RES Vi) ECs lle) Pe o E a E rs) ne ia ( ise elle SS VD elo ROK. 
errrSVD = norm(X-XrSVD, 2) /norm(X, 2); 


Code 1.12: [Python] Compute the randomized SVD of high-resolution image. 
A = imread (es path. JOrn C. 7 DALAT Jupiter. Jpg N) 

X = np.mean(A,axis=2) # Convert RGB -> grayscale 

U, os, VI — np. linalg:svd(X, cull matrices-0) #7 Full SYD 


r = 400 # Target rank 

q= 1 # Power iterations 

p=5 # Oversampling parameter 
EU ror EV e roD T Te) 


## Reconstruction 


Sen = Uj: : (rtl NC) nr- diag(Si: (r11) Ip) Ie VII: (rti), | 

errSVD = np.linalg.norm(X-XSVD,ord=2) /np.linalg.norm(X,ord 
=2) 

Krov Se acl eerie I) Ce ae Aaro (ec (earl) I) ich sere SiGe) || 

errSVD = np.linalg.norm(X-XrSVD, ord=2) /np.linalg.norm(X,ord 
=2) 
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1.9 Tensor Decompositions and N-Way Data Arrays 


Low-rank decompositions can be generalized beyond matrices. This is impor- 
tant, as the SVD requires that disparate types of data be flattened into a single 
vector in order to evaluate correlated structures. For instance, different time 
snapshots (columns) of a matrix may include measurements as diverse as tem- 
perature, pressure, concentration of a substance, etc. Additionally, there may 
be categorical data. Vectorizing this data generally does not make sense. Ul- 
timately, what is desired is to preserve the various data structures and types 
in their own, independent directions. Matrices can be generalized to N-way ar- 
rays, or tensors, where the data is more appropriately arranged without forcing 
a data-flattening process. 

The construction of data tensors requires that we revisit the a associ- 
ated with tensor addition, multiplication, and inner products [401]. We denote 
the rth column of a matrix A by a,. Given matrices A € R/** a B € R**, 
their Khatri-Rao product is denoted by A © B and is defined to be the IJ x K 
matrix of column-wise Kronecker products, namely 


AOB=(a 8b -:: ax @bx). 


For an N-way tensor A of size I, x Iz x --- x Iy, we denote its i = (i1, i2,..., iN) 
entry by qj. 

The inner product between two N-way tensors A and B of compatible di- 
mensions is given by 


B) = D aibi. 


The Frobenius norm of a tensor A, denoted by || Al||r, is the square root of the 
inner product of A with itself, namely ||A||; = y (A, A). Finally, the mode-n 
matricization or unfolding of a tensor A is denoted by mA (n). 

Let M represent an N-way data tensor of size I; x I2 x --- x Iy. We are 
interested in an R-component CANDECOMP/PARAFAC (CP) 
factor model 


M = X à, mal” o- o mal), (1.57) 


where o represents outer product and ma\”) represents the rth column of the 
factor matrix mA of size I,, x R. The CP decomposition refers to canonical 
decomposition (CANDECOMP) and parallel factors analysis (PARAFAC), respec- 
tively. We refer to each summand as a component. Assuming each factor matrix 
has been column-normalized to have unit Euclidean length, we refer to the 4, 
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(a) 
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Figure 1.29: Comparison of the SVD and tensor decomposition frameworks. 
Both methods produce an approximation to the original data matrix by sums of 
outer products. Specifically, the tensor decomposition generalizes the concept 
of the SVD to N-way arrays of data without having to flatten (vectorize) the 
data. 


as weights. We will use the shorthand notation where \ = (\i,.--,AR)" B5]. 
A tensor that has a CP decomposition is sometimes referred to as a Kruskal 
tensor. 

For the rest of this chapter, we consider a three-way CP tensor decomposi- 
tion (see Fig. |1.29), where two modes index state variation and the third mode 
indexes time variation: 


R 
M= AA; BoC, 


r=1 


Let A € R”*? and B € R”?*® denote the factor matrices corresponding to the 
two state modes and C € R?*F denote the factor matrix corresponding to the 
time mode. This three-way decomposition is compared to the SVD in Fig. 

To illustrate the tensor decomposition, we use the MATLAB N-way toolbox 
developed by Bro and co-workers [116], which is available on the Math- 
works file exchange. This simple-to-use package provides a variety of tools 
to extract tensor decompositions and evaluate the factor models generated. In 
the specific example considered here, we generate data from a spatio-temporal 
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Figure 1.30: Example N-way array data set created from the function (1.58). 
The data matrix is A € R!?!*10!*315_ A CP tensor decomposition can be used to 
extract the two underlying structures that produced the data. 


function (see Fig.|1.30): 

F(x,y, t) = exp(—a? — 0.5y) cos(2t) + sech(x) tanh(«) exp(—0.2y”) sin(t). (1.58) 
This model has two spatial modes with two distinct temporal frequencies, thus 
a two-factor model should be sufficient to extract the underlying spatial and 
temporal modes. To construct this function, Code is used. 

Code 1.13: [MATLAB] Creating tensor data. 
Oe OnE Or On E ede — On O lacs soa 
[X,Y,T]=meshgrid(x,y,t); 
P=expi(] Oe. 2 Ono cy 202) ep a(COSi(2Z 47) are 
(sech (X) .x*tanh(X) .xexp(-0.2*Y.°2)).*sin(T); 


Code 1.13: [Python] Creating tensor data. 


xe np.arange (575-01, 0i) 

y = np.arange(-6,6.01,0.1) 

t = np.arange(0,10*«np.pit+0.1,0.1) 

xi, I = np-meshgrid(x, y) 

co — EE NEE A ar O aa a AR DCO A cb (np. 
divide (np.ones_like(X),np.cosh(X)) * np.tanh(X) = np.exp 


(GOR 2 Neri 21) ee e E 
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Figure 1.31: Three-way tensor decomposition of the function discretized 
so that the data matrix is A € R!!*1!*31°_ A CP tensor decomposition can be 
used to extract the two underlying structures that produced the data. The first 
factor is in blue, and the second factor is in red. The three distinct directions of 
the data (parallel factors) are illustrated in (a) the y direction, (b) the x direction, 
and (c) the time t. 


Note that the meshgrid command is capable of generating N-way arrays. 
Indeed, MATLAB and Python have no difficulties specifying higher-dimensional 
arrays and tensors. Specifically, one can easily generate N-way data matrices 
with arbitrary dimensions. The MATLAB command A = randn(10, 10, 10, 10, 10) 
generates a five-way hypercube with random values in each of the five direc- 
tions of the array. 

Figure[1.30|/shows eight snapshots of the function discretized with the 
code above. The N-way array data generated from the MATLAB code produces 
A € R™!*101x315 which is of total dimension 10°.The CP tensor decomposition 
can be used to extract a two-factor model for this three-way array, thus produc- 
ing two vectors in each direction of space x, space y, and time t. 

The N-way toolbox provides a simple architecture for performing tensor 
decompositions. The PARAFAC command structure can easily take the input 
function (1.58), which is discretized in the code above, and provide a two-factor 
model. Codey moda the tensor model. 


Code 1.14: [MATLAB] Two-factor tensor model. 


model=parafac(A,2); 
[A1,A2,A3]=fac2let (model); 
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Code 1.14: [Python] Two-factor tensor model. 


from tensorly.decomposition import parafac 
Al, A2, A3 = parafac(A,2) 


Note that in the above MATLAB code, the fac2let command turns the factors 
in the model into their component matrices. Further note that the meshgrid 
arrangement of the data is different from parafac since the x and y directions 
are switched. 

Figure shows the results of the N-way tensor decomposition for the 
prescribed two-factor model. Specifically, the two vectors along each of the 
three directions of the array are illustrated. For this example, the exact answer is 
known since the data was constructed from the rank-two model (1.58). The first 
set of two modes (along the original y direction) are Gaussian as prescribed. The 
second set of two modes (along the original x direction) include a Gaussian for 
the first function, and the antisymmetric sech(x) tanh(x) for the second func- 
tion. The third set of two modes correspond to the time dynamics of the two 
functions: cos(2t) and sin(t), respectively. Thus, the two-factor model produced 
by the CP tensor decomposition returns the expected, low-rank functions that 
produced the high-dimensional data matrix A. 

Recent theoretical and computational advances in N-way decompositions 
are opening up the potential for tensor decompositions in many fields. For N 
large, such decompositions can be computationally intractable due to the size 
of the data. Indeed, even in the simple example illustrated in Figs.[1.30|and[1.31] 
there are 10° data points. Ultimately, the CP tensor decomposition does not 
scale well with additional data dimensions. However, randomized techniques 
are helping yield tractable computations even for large data sets [214,234]. As 
with the SVD, randomized methods exploit the underlying low-rank structure 
of the data in order to produce an accurate approximation through the sum 
of rank-one outer products. Additionally, tensor decompositions can be com- 
bined with constraints on the form of the parallel factors in order to produce 
more easily interpretable results [464]. This gives a framework for producing 
interpretable and scalable computations of N-way data arrays. 
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Homework 


Exercise 1-1. Load the image dog.jpg and compute the full SVD. Choose a rank 
r < mand confirm that the matrix U*U is the r x r identity matrix. Now con- 
firm that UU* is not the identity matrix. Compute the norm of the error between 
UU* and the n x n identity matrix as the rank r varies from 1 to n and plot the 
error. 


Exercise 1-2. Load the image dog.jpg and compute the economy SVD. Com- 
pute the relative reconstruction error of the truncated SVD in the Frobenius 
norm as a function of the rank r. Square this error to compute the fraction of 
missing variance as a function of r. You may also decide to plot 1 minus the 
error or missing variance to visualize the amount of norm or variance captured 
at a given rank r. Plot these quantities along with the cumulative sum of singu- 
lar values as a function of r. Find the rank r where the reconstruction captures 
99% of the total variance. Compare this with the rank r where the reconstruc- 
tion captures 99% in the Frobenius norm and with the rank r that captures 99% 
of the cumulative sum of singular values. 


Exercise 1-3. Load the Yale B image database and compute the economy SVD 
using a standard svd command. Now compute the SVD with the method of 
snapshots. Compare the singular value spectra on a log plot. Compare the first 
10 left singular vectors using each method (remember to reshape them into the 
shape of a face). Now compare a few singular vectors farther down the spec- 
trum. Explain your findings. 


Exercise 1-4. Generate a random 100 x 100 matrix, i.e., a matrix whose entries are 
sampled from a normal distribution. Compute the SVD of this matrix and plot 
the singular values. Repeat this 100 times and plot the distribution of singular 
values in a box-and-whisker plot. Plot the mean and median singular values as 
a function of r. Now repeat this for different matrix sizes (e.g., 50 x 50, 200 x 200, 
500 x 500, 1000 x 1000, etc.). 


Exercise 1-5. Compare the singular value distributions for a 1000 x 1000 uni- 
formly distributed random matrix and a Gaussian random matrix of the same 
size. Adapt the Gavish—Donoho algorithm to filter uniform noise based on this 
singular value distribution. Add uniform noise to a data set (either an image 
or the test low-rank signal) and apply this thresholding algorithm to filter the 
noise. Vary the magnitude of the noise and compare the results. Is the filtering 
good or bad? 


Exercise 1-6. This exercise will test the concept of condition number. We will 
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test the accuracy of solving Ax = b when noise is added to b for matrices A 
with different condition numbers. 


(a) 


(b) 


(c) 


(d) 


To build the two matrices, generate a random U € R'*! and V € 
100x100 and then create two © matrices: the first © will have singular 
values spaced logarithmically from 100 to 1, and the second © will have 
singular values spaced logarithmically from 100 to 1076. Use these ma- 
trices to create two A matrices, one with a condition number of 100 and 
the other with a condition number of 100 million. Now create a random b 
vector, solve for x using the two methods, and compare the results. Add 
a small e to b, with norm 107° smaller than the norm of b. Now solve for 
x using this new b + e and compare the results. 


Now repeat the experiment above with many different noise vectors e€ 
and compute the distribution of the error; plot this error as a histogram 
and explain the shape. 


Repeat the above experiment comparing two A matrices with different 
singular value distributions: the first © will have values spaced linearly 
from 100 to 1 and the second © will have value spaced logarithmically 
from 100 to 1. Does anything change? Please explain why yes or no. 


Repeat the above experiment, but now with an A matrix that has size 
100 x 10. Explain any changes. 


Exercise 1-7. Load the data set for fluid flow past a cylinder (you can either 
download this from our book/http://DMDbook.. comjor generate it using the 
IBPM code on GitHub). Each column is a flow field that has been reshaped into 
a vector. 


(a) 


(b) 


Compute the SVD of this data set and plot the singular value spectrum 
and the leading singular vectors. The U matrix contains eigenflow fields 
and the /V* represents the amplitudes of these eigenflows as the flow 
evolves in time. 


Write a code to plot the reconstructed movie for various truncation values 
r. Compute the r value needed to capture 90%, 99%, and 99.9% of the flow 
energy based on the singular value spectrum (recall that energy is given 
by the Frobenius norm squared). Plot the movies for each of these trunca- 
tion values and compare the fidelity. Also compute the squared Frobenius 
norm of the error between the true matrix X and the reconstructed matrix 
X, where X is the flow field movie. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


1.9. TENSOR DECOMPOSITIONS AND N-WAY DATA ARRAYS 


(c) Fix a value r = 10 and compute the truncated SVD. Each column w, € 


63 


R10 


of the matrix W = $V” represents the mixture of the first 10 eigenflows 
in the kth column of X. Verify this by comparing the kth snapshot of X 


(d) Now, build a linear regression model for how the amplitudes w, evolve 


in time. This will be a dynamical system: 


Wri = A Wz. 


Create a matrix W with the first 1 through m — 1 columns of XV* and 
another matrix W’ with the 2 through m columns of XV*. We will now 


try to solve for a best-fit A matrix so that 


W’ x AW. 


Compute the SVD of W and use this to compute the pseudo-inverse of 
W to solve for A. Compute the eigenvalues of A and plot them in the 


complex plane. 


(e) Use this A matrix to advance the state w, = A*-!w; starting from wy. 
Plot the reconstructed flow field using these predicted amplitude vectors 


and compare with the true values. 


This exercise derived the dynamic mode decomposition from Section]7.2] 
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Fourier and Wavelet Transforms 


A central concern of mathematical physics and engineering mathematics in- 
volves the transformation of equations into a coordinate system where expres- 
sions simplify, decouple, and are amenable to computation and analysis. This 
is a common theme throughout this book, in a wide variety of domains, in- 
cluding data analysis (e.g., the singular value decomposition, SVD), dynamical 
systems (e.g., spectral decomposition into eigenvalues and eigenvectors), and 
control (e.g., defining coordinate systems by controllability and observability). 
Perhaps the most foundational and ubiquitous coordinate transformation was 
introduced by J.-B. Joseph Fourier in the early 1800s to investigate the theory 
of heat [249]. Fourier introduced the concept that sine and cosine functions of 
increasing frequency provide an orthogonal basis for the space of solution func- 
tions. Indeed, the Fourier transform basis of sines and cosines are eigenfunc- 
tions of the heat equation, with the specific frequencies serving as the eigenval- 
ues, determined by the geometry, and amplitudes determined by the boundary 
conditions. 

Fourier’s seminal work provided the mathematical foundation for Hilbert 
spaces, operator theory, approximation theory, and the subsequent revolution 
in analytical and computational mathematics. Fast forward 200 years, and the 
fast Fourier transform has become the cornerstone of computational mathemat- 
ics, enabling real-time image and audio compression, global communication 
networks, modern devices and hardware, numerical physics and engineering 
at scale, and advanced data analysis. Simply put, the fast Fourier transform 
has had a more significant and profound role in shaping the modern world 
than any other algorithm to date. 

With increasingly complex problems, data sets, and computational geome- 
tries, simple Fourier sine and cosine bases have given way to tailored bases, 
such as the data-driven SVD. In fact, the SVD basis can be used as a direct ana- 
logue of the Fourier basis for solving partial differential equations (PDEs) with 
complex geometries, as will be discussed later. In addition, related functions, 
called wavelets, have been developed for advanced signal processing and com- 
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Figure 2.1: Discretized functions used to illustrate the inner product. 


pression efforts. In this chapter, we will demonstrate a few of the many uses of 
Fourier and wavelet transforms. 


2.1 Fourier Series and Fourier Transforms 


Before describing the computational implementation of Fourier transforms on 
vectors of data, here we introduce the analytic Fourier series and Fourier trans- 
form, defined for continuous functions. Naturally, the discrete and continu- 
ous formulations should match in the limit of data with infinitely fine reso- 
lution. The Fourier series and transform are intimately related to the geometry 
of infinite-dimensional function spaces, or Hilbert spaces, which generalize the 
notion of vector spaces to include functions with infinitely many degrees of 
freedom. Thus, we begin with an introduction to function spaces. 


Inner Products of Functions and Vectors 


In this section, we will make use of inner products and norms of functions. In 
particular, we will use the common Hermitian inner product for functions f(x) 
and g(x) defined for x on a domain x € [a,b]: 


Geah = / f()g(e) de, (2.1) 


where g denotes the complex conjugate. 

The inner product of functions may seem strange or unmotivated at first, 
but this definition becomes clear when we consider the inner product of vectors 
of data. In particular, if we discretize the functions f(x) and g(x) into vectors 
of data, as in Fig. we would like the vector inner product to converge to 
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the function inner product as the sampling resolution is increased. The inner 


product of the data vectors f = [fi fa +- al and g = |g g2 +> al is 
defined by 
(f, g) = gf = ` SrGk = > f(&e)G(@x)- (2.2) 
k=l = 


The magnitude of this inner product will grow as more data points are added; 
i.e., as n increases. Thus, we may normalize by Ax = (b — a)/(n — 1), where 
b = £, and a = zı: 


Taha) = D eaaa, 2.3) 


n — 


which is the Riemann approximation to the continuous function inner product 
in (2.1). It is now clear that as we take the limit of n — œ (i.e., infinite data res- 
olution, with Az — 0), the vector inner product converges to the inner product 


of functions in (2.1). 


This inner product also induces a norm on functions, given by 


b E 1/2 
If = (A= VEA = (f tF) | (2.4) 


The set of all functions with bounded norm defines the set of square integrable 
functions, denoted by L?([a, b]); this is also known as the set of Lebesgue in- 
tegrable functions. The interval [a,b] may also be chosen to be infinite (e.g., 
(—oo, œ0)), semi-infinite (e.g., |a, 00)), or periodic (e.g., |—7, 7)). A fun example 
of a function in L?({1,00)) is f(x) = 1/x. The square of f has finite integral from 
1 to oo, although the integral of the function itself diverges. The shape obtained 
by rotating this function about the x-axis is known as Gabriel’s horn, as the 
volume is finite (related to the integral of f°), while the surface area is infinite 
(related to the integral of f). 

As in finite-dimensional vector spaces, the inner product may be used to 
project a function into an new coordinate system defined by a basis of orthog- 
onal functions. A Fourier series representation of a function f is precisely a 
projection of this function onto the orthogonal set of sine and cosine functions 
with integer period on the domain [a,b]. This is the subject of the following 
sections. 


Fourier Series 


A fundamental result in Fourier analysis is that if f(x) is periodic and piecewise 
smooth, then it can be written in terms of a Fourier series, which is an infinite 
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sum of cosines and sines of increasing frequency. In particular, if f(x) is 27- 
periodic, it may be written as 


f(x#)==—+ S (ax cos(kx) + bx sin(ka)). (2.5) 
The coefficients a; and b are given by 
1 T 
ak = -{ f(x) cos(kx) da, (2.6a) 
bk = -{ f(x) sin(kx) dz, (2.6b) 


which may be viewed as the coordinates obtained by projecting the function 
onto the orthogonal cosine and sine basis {cos(kx), sin(kx) }?°.). In other words, 
the integrals in may be rewritten in terms of the inner product as 


1 


ak = Toosje cos(kx)), (2.7a) 
1 ' 
b; = Jinko e) sin(kx)}, (2.7b) 
where ||cos(kz) ||? = ||sin(kx)||? = a. This factor of 1/7 is easy to verify by nu- 


merically integrating cos(x)” and sin(x)? from —7 to 7. 
The Fourier series for an L-periodic function on |0, L) is similarly given by 


f(x) = > + ` (o cos (272) + bk sa( 2), (2.8) 
k=1 


with coefficients a, and b, given by 


i 

ak = Z f f(x) cos (272) dz, (2.9a) 
L 

h= Z i f(z) sin( =") dar. (2.9b) 


Because we are expanding functions in terms of sine and cosine functions, it 
is also natural to use Euler’s formula e**” = cos(kx) +i sin(kz) to write a Fourier 
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series in complex form with complex coefficients ck = ax + 18x: 


f(z) = ` cpe? 


k=—0o 
oo 


= ` (ak + ibp) (cos(kx) + isin(kx)) 


k=—0o 


(ao + ibo) + S (a-r + ap) cos(kx) + (B_~ — r) sin(kx)] 
k=1 


+4 SS [(8- + Br) cos(kx) — (a-p — ap) sin(kz)]. (2.10) 
k=l 
If f(x) is real-valued, then a_; = a, and b-k = — br, so that cp = Cr. 


Thus, the functions Y, = e’** for k € Z (i.e., for integer k) provide a basis 
for periodic, complex-valued functions on an interval [0, 27). It is simple to see 
that these functions are orthogonal: 


T T i(j—k)a 17 TE 
= ije ike qn — i(j—k)z gn — | © = 0 iff Fk, 
(Wi, Ve) f ere dg J e dg pa — 5| p C ifj =k 


T T 


So (Wj, Yk} = 274;x, where ô is the Kronecker delta function. Similarly, the func- 
tions e”"**/! provide a basis for L?([0, L)), the space of square integrable func- 
tions defined on x € [0, L). 

In principle, a Fourier series is just a change of coordinates of a function 
f(x) into an infinite-dimensional orthogonal function space spanned by sines 
and cosines (i.e., Yp = e” = cos(kx) + isin(kx)): 

F(a) = D aia) = = D2 (Fl). vel@))ve(@). e 


k=- k=—0o 


The coefficients are given by cp = (1/(27))(f (x), Ye(x)). The factor of 1/(27) 
normalizes the projection by the square of the norm of Yp, i.e., ||qx||? = 27. This 
is consistent with our standard finite-dimensional notion of change of basis, as 
in Fig. A vector f may be written in the (z, y) or (u, v) coordinate systems, 
via projection onto these orthogonal bases: 


s eee Sa 
oe U Sa T 
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Figure 2.2: Change of coordinates of a vector in two dimensions. 


Example: Fourier Series for a Continuous Hat Function 


As a simple example, we demonstrate the use of Fourier series to approximate 
a continuous hat function, defined from —7 to r: 


0 for x € [—1,7/2), 
= J1l+22/n for x € [—7/2,0), 
LO) 1—22/a for x € [0,7/2), a 
0 for x € [1/2,7). 


Because this function is even, it may be approximated with cosines alone. The 
Fourier series for f(x) is shown in Fig. [2.3] for an increasing number of cosines. 

Figure|2.4]shows the coefficients a; of the even cosine functions, along with 
the approximation error, for an increasing number of modes. The error de- 
creases monotonically, as expected. The coefficients b; corresponding to the odd 
sine functions are not shown, as they are identically zero since the hat function 
is even. 


Code 2.1: [MATLAB] Fourier series approximation to a hat function. 


% Define domain 


dx = 0.001; 
L = pi; 
ee (llarceoro neni) vis 


n = length (x); nquart = floor (n/4); 


Define hat function 


Fh o 


= 0Ox«x; 
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Time 


Figure 2.3: (top) Hat function and Fourier cosine series approximation for 
n = 7. (middle) Fourier cosines used to approximate the hat function. (bottom) 
Zoom in of modes with small amplitude and high frequency. 


f (nquart:2*nquart) = 4%*«(1l:nquart+1)/n; 
f (2*nquartt+1:3*nquart) = 1-4*(O0:nquart-1)/n; 
Ploti(s, t; kK, Linens dih®, a5)y, hold on 


% Compute Fourier series 
CC = jet (20); 


AO = sum(f.xones (size (x) )) *dx; 

£FS = AO/2; 

for k=1:20 
A(k) = sum(f.*cos (pixk+*x/L)) «dx; %@ Inner product 
B(k) = sum(f.*sin(pixk*x/L) ) *dx; 
fFS = fFS + A(k)x*cos(k*pixx/L) + B(k)*sin(k*pi+x/L); 


plot (z -ES nl Color Ane Oe earne Nden pally 2) 
end 


Code 2.1: [Python] Fourier series approximation to a hat function. 


# Define domain 
dx = 10 010n: 
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Figure 2.4: Fourier coefficients (top) and relative error of Fourier cosine approx- 
imation with true function (bottom) for the hat function in Fig. 2.3] The n = 7 
approximation is highlighted with a blue circle. 


L = np.pi 

x = L * np.arange (-1+dx, 1+dx, dx) 
n = len(x) 

nquart = int (np.floor(n/4) ) 


# Define hat function 
f = np.zeros_like (x) 
f[nquart:2*nguart] = (4/n) *«np.arange(1,nquart+1) 
f[2*nquart:3*nquart] = np.ones(nquart) -~ (4/n)*np.arange(0, 
nquart) 


# Compute Fourier series 
AO = np.sum(f +» np.ones_like(x)) * dx 
fFS = AO/2 


A = np.zeros (20) 
B = np.zeros (20) 
for k in range(20): 


Alki = np.sum(f * np.cos(np.pix(k+1)*x/L)) * dx # Inner 
Product 

B[k] = np.sum(f * np.sin(np.pix(kt+1)*x/L)) * dx 

fFS = IES + A[k]*np.cos((k+1)*np.pi*x/L) + B[k]*np.sin( ( 


k+1) *«np.pix*x/L) 
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Figure 2.5: Gibbs phenomenon is characterized by high-frequency oscillations 
near discontinuities. The black curve is discontinuous, and the red curve is the 
Fourier approximation. 


| ax plot (x) ERS) 


Example: Fourier Series for a Discontinuous Hat Function 


We now consider the discontinuous square hat function, defined on [0, L), shown 
in Fig. The function is given by: 


0 forze [0, L/4), 
fla)=<1 for2 <€|L/4,3L/4), (2.14) 
0 forze [3L/4, L). 


The truncated Fourier series is plagued by ringing oscillations, known as the 
Gibbs phenomenon, around the sharp corners of the step function. This ex- 
ample highlights the challenge of applying the Fourier series to discontinu- 
ous functions. The code to reproduce this example is available on the book’s 
GitHub. 


Fourier Transform 


The Fourier series is defined for periodic functions, so that, outside the domain 

of definition, the function repeats itself forever. The Fourier transform integral 

is essentially the limit of a Fourier series as the length of the domain goes to 

infinity, which allows us to define a function defined on (—oo, 00) without re- 

peating, as shown in Fig. We will consider the Fourier series on a domain 
€ [—L, L), and then let L > oo. On this domain, the Fourier series is 


a 2 fos cos( E7 z) + bk sin( *))- 2 cpeTe/L (2.15) 
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—L L 


Figure 2.6: (top) Fourier series is only valid for a function that is periodic on 
the domain |- ZL, L). (bottom) The Fourier transform is valid for generic non- 
periodic functions. 


with the coefficients given by 


L 
ORA O (2.16) 


ESOT 


Restating the previous results, f(x) is now represented by a sum of sines and 
cosines with a discrete set of frequencies given by wą = kr/L. Taking the limit 
as L — oo, these discrete frequencies become a continuous range of frequencies. 
Define w = ka/L, Aw = T/L, and take the limit L > 00, so that Aw — 0: 


S 


ue 
f(a) = my al jee de eet: (2.17) 
aes 


Gane Vx (@)) 


When we take the limit, the expression n (f(x), Yk(x)) will become the Fourier 
transform of f(x), denoted by f(w)  F(f(x)). In addition, the summation 
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with weight Aw becomes a Riemann integral, resulting in the following: 


fue FGG) = f Hæ) de (2.18a) 
fa) = FF) = f T Hoe dw. (2.18b) 


These two integrals are arn as the Fourier ene pair. Both integrals con- 
verge as long as f° |f(x)|dz < co and f% _|f(w)|dw < 00; i.e., as long as 
both functions oe - the space of o integrable functions, f,f € 
L [(—00, o0). 

The Fourier transform is particularly useful because of a number of proper- 
ties, including linearity, and how derivatives of functions behave in the Fourier 
transform domain. These properties have been used extensively for data anal- 
ysis and scientific computing (e.g., to solve PDEs accurately and efficiently), as 
will be explored throughout this chapter. 


Derivatives of Functions 


The Fourier transform of the derivative of a function is given by 


F(of @)) = [7 f'(x cine da (2.19a) 


=| e 5 / i fo) ae dz (2.19b) 
a fae dx (2.19c) 
= iwF(f(x)). (2.19d) 


The formula for the Fourier transform of a higher derivative is given by 


= (a f w) NE (2.20) 


This is an extremely important property of the Fourier transform, as it will al- 


low us to turn PDEs into ordinary differential equations (ODEs), closely related 
to the separation of variables: 


F 5 5 
Un = CUr === ûn = —cw ù. (2.21) 


(PDE) (ODE) 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


2.1. FOURIER SERIES AND FOURIER TRANSFORMS 75 


Linearity of Fourier Transforms 


The Fourier transform is a linear operator, so that 


F(af(z)+ Bg(x)) = aF(f) + BF(g) (2.22) 
and 
F (af w) + B9(w)) = aF Hf) + BF“ (9). (2.23) 
Parseval’s Theorem 
Ie | f(w)|? dw = 27 L. |f(£)|? dz. (2.24) 


In other words, the Fourier transform preserves the L2-norm, up to a constant. 
This is closely related to unitarity, so that two functions will retain the same 
inner product before and after the Fourier transform. This property is useful 
for approximation and truncation, providing the ability to bound error at a 
given truncation. 


Convolution 


The convolution of two functions is particularly well behaved in the Fourier 
domain, being the product of the two Fourier-transformed functions. We define 
the convolution of two functions f(x) and g(x) as f xg: 


et f(z- ) dé. (2.25) 
If we let f = F(f) and g = F(g), then 

Fafe = > | fledged (2.26a) 
= hd F (we sf gue av) dw (2.26b) 
->f i f IAE dudy (2.260) 
= f ww ( af foe dw) dy (2.26d) 

eS 

f(x—-y) 

-{ gy)f(z—-y)dy=gexfafrg. (2.26e) 


Thus, multiplying functions in the frequency domain is the same as convolv- 
ing functions in the spatial domain. This will be particularly useful for control 
systems and transfer functions with the related Laplace transform. 
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fi Ín 


T1 T2 T3 Tn 


Figure 2.7: Discrete data sampled for the discrete Fourier transform. 


2.2 Discrete Fourier Transform (DFT) and Fast Fourier 
Transform (FFT) 


Until now, we have considered the Fourier series and Fourier transform for con- 
tinuous functions f(x). However, when computing or working with real data, 
it is necessary to approximate the Fourier transform on discrete vectors of data. 
The resulting discrete Fourier transform (DFT) is essentially a discretized ver- 
sion of the Fourier series for vectors of data f = [ fi fo fs > fa T obtained 
by discretizing the function f(x) at a regular spacing, Az, as in ri 

The DFT is tremendously useful for numerical approximation and compu- 
tation, but it does not scale well to very large n > 1, as the simple formulation 
involves multiplication by a dense n x n matrix, requiring O(n”) operations. 
In 1965, James W. Cooley (IBM) and John W. Tukey (Princeton) developed the 
revolutionary fast Fourier transform (FFT) algorithm that scales as 
O(n log(n)). As n becomes very large, the log(n) component grows slowly, and 
the algorithm approaches a linear scaling. Their algorithm was based on a frac- 
tal symmetry in the Fourier transform that allows an n-dimensional DFT to 
be solved with a number of lower-dimensional DFT computations. Although 
the different computational scaling between the DFT and FFT implementations 
may seem like a small difference, the fast O(n log(n)) scaling is what enables 
the ubiquitous use of the FFT in real-time communication, based on audio and 
image compression [731]. 

It is important to note that Cooley and Tukey did not invent the idea of the 
FFT, as there were decades of prior work developing special cases, although 
they provided the general formulation that is currently used. Amazingly, the 
FFT algorithm was formulated by Gauss over 150 years earlier in 1805 to ap- 
proximate the orbits of the asteroids Pallas and Juno from measurement data, 
as he required a highly accurate interpolation scheme [319]. As the computa- 
tions were performed by Gauss in his head and on paper, he required a fast 
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algorithm, and developed the FFT. However, Gauss did not view this as a ma- 
jor breakthrough and his formulation only appeared later in 1866 in his com- 
piled notes [265]. It is interesting to note that Gauss’s discovery even pre-dates 
Fourier’s announcement of the Fourier series expansion in 1807, which was 


later published in 1822 [248]. 


Discrete Fourier Transform 


Although we will always use the FFT for computations, it is illustrative to begin 
with the simplest formulation of the DFT. The discrete Fourier transform is 
given by 


n-1 


p= pe A, (2.27) 
j=0 
and the inverse discrete Fourier transform (iDFT) is given by 
1 n—1 
ed È i2rjk/n 
fe = — D Je i (2.28) 
j=0 
Thus, the DFT is a linear operator (i.e., a matrix) that maps the data points in f 
to the frequency domain f: 


DFT 


Fansa === {aeh (2.29) 


For a given number of points n, the DFT represents the data using sine and 
cosine functions with integer multiples of a fundamental frequency, wn = e~?"/". 
The DFT may be computed by matrix multiplication: 


fi Lol a fı 
fo l Wy Wn T a f 
fal =|1 on wh Wn f|. (2.30) 
fa L ae ee pae uy Ín 


The output vector f contains the Fourier coefficients for the input vector f, and 
the DFT matrix F is a unitary Vandermonde matrix. The matrix F is complex- 
valued, so the output f has both a magnitude and a phase, which will both have 
useful physical interpretations. 

The real part of the DFT matrix F is shown in Fig. [2.8] for n = 256. Code[2.2] 
generates and plots this matrix. It can be seen from this image that there is 
a hierarchical and highly symmetric multi-scale structure to F. Each row and 
column is a cosine function with increasing frequency. 
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—1 


Figure 2.8: Real part of the DFT matrix for n = 256. 


Code 2.2: [MATLAB] Generate discrete Fourier transform matrix. 


n = 256; 
w = exp(-i*2*pi/n); 


for j=l1:n 
DER =e VES De Geli 
end 


@ Fast 
[I,J] = meshgrid(1:n,1:n); 
DE = w. ((T 1). (G19) 
imagesc (real (DFT) ) 


Code 2.2: [Python] Generate discrete Fourier transform matrix. 
n= 256 
w = np-exp 1] + 2 x np-pi y n) 
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# Slow 
for i in range(n): 
for k in range(n): 
DFT[i,k] = wx (ixk) 


DFT = np.real (DFT) 
J,K = np.meshgrid(np.arange(n),np.arange(n) ) 


DFT = np.power (w, JxK) 
DFT = np.real (DFT) 


Fast Fourier Transform 


As mentioned earlier, multiplying by the DFT matrix F involves O(n”) opera- 
tions. The fast Fourier transform scales as O(n log(n)), enabling a tremendous 
range of applications, including audio and image compression in MP3 and JPG 
formats, streaming video, satellite communications, and the cellular network, 
to name only a few of the myriad applications. For example, audio is gener- 
ally sampled at 44.1 kHz, or 44100 samples per second. For 10s of audio, the 
vector f will have dimension n = 4.41 x 10°. Computing the DFT using matrix 
multiplication involves approximately 2 x 10'', or 200 billion, multiplications. 
In contrast, the FFT requires approximately 6 x 10°, which amounts to a speed- 
up factor of over 30 000. Thus, the FFT has become synonymous with the DFT, 
and FFT libraries are built in to nearly every device and operating system that 
performs digital signal processing. 

To see the tremendous benefit of the FFT, consider the transmission, stor- 
age, and decoding of an audio signal. We will see later that many signals are 
highly compressible in the Fourier transform domain, meaning that most of 
the coefficients of f are small and can be discarded. This enables much more 
efficient storage and transmission of the compressed signal, as only the non- 
zero Fourier coefficients must be transmitted. However, it is then necessary to 
rapidly encode and decode the compressed Fourier signal by computing the 
FFT and inverse FFT (iFFT). This is accomplished with the one-line MATLAB 
commands 


Fast Fourier transform 
Inverse fast Fourier transform 


>> that. = ££ (a): 
>> — rte (iia) 


3 
a 


and Python commands 


e Sele, S o EE E A # Fast Fourier transform 
[Sok = Np bie tthe (ENa); 7 Inverse fast Fourier erans torm 
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The basic idea behind the FFT is that the DFT may be implemented much 
more efficiently if the number of data points n is a power of 2. For example, 
consider n = 1024 = 2'°. In this case, the DFT matrix F192, may be written as 


Is10 Ds12 F512 0 fiven 

: ai 
Ee —Ds12 0 F512 foda ( ) 
where feven are the even index elements of f, foaa are the odd index elements of 
f, I519 is the 512 x 512 identity matrix, and D512 is given by 


f 


I 
"zj 
= 
Q 
N 
w 
=h 

I 


ro g se © 
fiw, ae 0 

Ds = |0 0 w o. (2.32) 
00 g e gill 


This expression can be derived from a careful accounting and reorganization of 
the terms in and (2.30). If n = 2?, this process can be repeated, and F512 
can be represented by F256, which can then be represented by Fi23 > Fes > 
F2 > ---.Ifn # 2, the vector can be padded with zeros until it is a power of 2. 
The FFT then involves an efficient interleaving of even and odd indices of sub- 
vectors of f, and the computation of several smaller 2 x 2 DFT computations. 


FFT Example: Noise Filtering 


To gain familiarity with how to use and interpret the FFT, we will begin with 
a simple example that uses the FFT to de-noise a signal. We will consider a 
function of time f(t): 


f(t) = sin(2r fit) + sin(27 fot), (2.33) 


with frequencies fı = 50 and f = 120. We then add a large amount of Gaussian 
white noise to this signal, as shown in the top panel of Fig. 

It is possible to compute the fast Fourier transform of this noisy signal 
using the fft command. The power spectral density (PSD) is the normalized 
squared magnitude of f, and indicates how much power the signal contains in 
each frequency. In Fig. (middle), it is clear that the noisy signal contains 
two large peaks at 50 Hz and 120 Hz. It is possible to zero-out components that 
have power below a threshold to remove noise from the signal. After inverse 
transforming the filtered signal, we find the clean and filtered time series match 
quite well (Fig. bottom). Code [2.3|performs each step and plots the results. 


Code 2.3: [MATLAB] Fast Fourier transform to de-noise signal. 


3%% Create a simple signal with two frequencies 
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Q 
N 
ax 

0 Boal’ ae ó “a a ALAN . Sd. D Scat’ a x, we A ws j = = 
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Frequency [Hz] 
5 
Clean 


f Filtered 


0 0.05 0.1 0.15 0.2 0.25 
Time [s] 


Figure 2.9: De-noising with FFT. (top) Noise is added to a simple signal given by 
a sum of two sine waves. (middle) In the Fourier domain, dominant peaks may 
be selected and the noise filtered. (bottom) The de-noised signal is obtained by 
inverse Fourier transforming the two dominant peaks. 


he = {AOL 

= Olas sly 

= sin(2*pix50«t) + sin(2*pix120«t); % Sum of 2 frequencies 
= f + 2.5*randn(size(t)); % Add some noise 


ene (Gt 
| 


Fh 


2%% Compute the Fast Fourier Transform FFT 
n = length (t); 


fhat — ob Pe (ey). % Compute the fast Fourier transform 
PSD = fhat.*conj(fhat)/n; % Power spectrum (power per freq) 
freq = 1/(dt*n)*(O:n); 3 Create x-axis of frequencies in Hz 
L= l: floor(n/ 2); 2 Only ploe the first half of fregs 


3% Use the PSD to filter out noise 
indices = PSD>100; % Find all freqs with large power 
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PSDclean = PSD.*indices; 3 Zero out all others 
fhat = indices.*fhat; a Zero Out Small Fourrer Coers. ani Y 
feile = dete (ihat); < Inverse FET for filtered time signal 


Code 2.3: [Python] Fast Fourier transform to de-noise signal. 


# Create a simple signal with two frequencies 
dt = 0.001 

= np.arange(0,1,dt) 

= NPSN (240p PIKSE) k Ne sin (Zep pais lOc) 
Clean = É 

= f + 2.5x*np.random.randn (len (t) ) # Add some noise 


leh. leh lei: (Gh 


## Compute the Fast Fourier Transform (FFT) 
n = len(t) 


Phere Np iter it (Ge, a) # Compute the FFT 

PSD = Ehat = ne. cong (fChat) / n # Power spectrum 
(power per freg) 

freq = (1/(dt»n)) * np.arange (n) # Create x-axis 


of frequencies in Hz 
L = np.arange(1,np.floor(n/2),dtype=’int’) # Only plot the 
rrot half OF fregs 


te Use tche PSD to milter OUE NOISE 


indices = PSD > 100 # Find all freqs with large power 

PSDclean = PSD * indices # Zero out all others 

thae = indices x bite # Zero out small Fourier coeffs. 
nia A 

feile = np PELs itte (hak) 7 inverse BEY for f1ltered Erme 
signal 


FFT Example: Spectral Derivatives 


For the next example, we will demonstrate the use of the FFT for the fast and ac- 
curate computation of derivatives. As we saw in (2.19), the continuous Fourier 
transform has the property that F(df/dx) = iwF(f). Similarly, the numerical 
derivative of a vector of discretized data can be well approximated by multi- 
plying each component of the discrete Fourier transform of the vector f by ix, 
where « = 27k;/n is the discrete wavenumber associated with that component. 
The accuracy and efficiency of the spectral derivative make it particularly use- 
ful for solving partial differential equations, as explored in the next section. 


To demonstrate this so-called spectral derivative, we will start with a func- 
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Figure 2.10: Comparison of the spectral derivative, computed using the FFT, 
with the finite-difference derivative. 


tion f(x) where we can compute the analytic derivative for comparison: 


d 2 
Ha) = —sin(r)e"* /5 — ag tf (2). (2.34) 
Figure compares the spectral derivative with the analytic derivative and 

the forward Euler finite-difference derivative using n = 128 discretization points: 


SF (ap) a f(@e+1) — Flxr) 
da Le 
The error of both differentiation schemes may be reduced by increasing n, 
which is the same as decreasing Ax. However, the error of the spectral deriva- 
tive improves more rapidly with increasing n than finite-difference schemes, 
as shown in Fig. The forward Euler differentiation is notoriously inaccu- 
rate, with error proportional to O(Ax); however, even increasing the order of 
a finite-difference scheme will not yield the same accuracy trend as the spec- 
tral derivative, which is effectively using information on the whole domain. 
Code[2.4]computes and compares the two differentiation schemes. 


f(z) = cos(x)” = 


(2.35) 


Code 2.4: [MATLAB] Fast Fourier transform to compute derivatives. 
we DLA i SO tee S (li) A 


x = -L/2:dx:L/2-dx; 
f = cos(x).*exp(-x.*2/25); 3 Bunce ton 
f = —-(sin(x) .*exp(-x.°2/25) + (2/25)*x.*f); & Derivative 


3% Derivative using FFT (spectral derivative) 

that te (ss) 

kappa = (2*pi/L)*«[-n/2:n/2-1]; 

kappa = f£ftshift (kappa); % Re-order fft frequencies 
dfhat = ixkappa.«fhat; 

dfFFT = real (ifft (dfhat)); 
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Figure 2.12: Gibbs phenomenon for the spectral derivative of a function with 
discontinuous derivative. 


Code 2.4: [Python] Fast Fourier transform to compute derivatives. 


## Derivative using FFT (spectral derivative) 

that — pw ten bie (ie) 

kappa = (2*np.pi/L) «np.arange(-n/2,n/2) 

kappa = np.fft.fftshift (kappa) # Re-order fft frequencies 
dfhat = kappa x fhat *« (19) 

@qdifht — np. real (np. rE- vere (dita), ) 


If the derivative of a function is discontinuous, then the spectral derivative 
will exhibit the Gibbs phenomenon, as shown in Fig. 
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2.3 Transforming Partial Differential Equations 


The Fourier transform was originally formulated in the 1800s as a change of co- 
ordinates for the heat equation into an eigenfunction coordinate system where 
the dynamics decouple. More generally, the Fourier transform is useful for 
transforming partial differential equations (PDEs) into ordinary differential equa- 
tions (ODEs), as in (2.21). Here, we will demonstrate the utility of the FFT to 
numerically solve a number of PDEs. For an excellent treatment of spectral 
methods for PDEs, see Trefethen ; extensions also exist for stiff PDEs [377]. 


Heat Equation 


The Fourier transform basis is ideally suited to solve the heat equation. In one 
spatial dimension, the heat equation is given by 


pO Ug, (2.36) 


where u(t, x) is the temperature distribution in time and space. If we Fourier- 
transform in space, then F (u(t, x)) = a(t, w). The PDE in (2.36) becomes 


ty = -aw À, (2.37) 


since the two spatial derivatives contribute (iw)? = —w? in the Fourier trans- 
form domain. Thus, by taking the Fourier transform, the PDE in (2.36) becomes 
an ODE for each fixed frequency w. The solution is given by 


a(t,w) =e? 40, w). (2.38) 


The function û(0,w) is the Fourier transform of the initial temperature distri- 
bution u(0, x). It is now clear that higher frequencies, corresponding to larger 
values of w, decay more rapidly as time evolves, so that sharp corners in the 
temperature distribution rapidly smooth out. We may take the inverse Fourier 
transform using the convolution property in (2.25), yielding 


1 
e 
2avy at 


—a? /(4a?t) * u(0, 2). 


(2.39) 


u(t,2) = F*(G(t,w)) = Fe") « u(0, 2) = 


To simulate this PDE numerically, it is simpler and more accurate to first 
transform to the frequency domain using the FFT. In this case (2.37) becomes 


ty, = —a° R À, (2.40) 


where « is the discretized frequency. It is important to use the fftshift command 
to reorder the wavenumbers according to the MATLAB convention. 
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Figure 2.13: Solution of the 1D heat equation in time for an initial condition 
given by a square hat function. As time evolves, the sharp corners rapidly 
smooth and the solution approaches a Gaussian function. 


x 


Figure 2.14: Evolution of the 1D heat equation in time, illustrated by a waterfall 
plot (left) and an x-t diagram (right). 


Code |2.5|simulates the one-dimensional (1D) heat equation using the FFT, 
as shown in Figs. |2.13]and In this example, because the PDE is linear, it is 
possible to advance the system using ode45 directly in the frequency domain, 
using the vector field given in Code 

Figures[2.13]and|2.14|show several different views of the temperature distri- 
bution u(t, x) as it evolves in time. Figure [2.13|shows the distribution at several 
times overlayed, and this same data is visualized in Fig. {2.14|in a waterfall plot 
(left) and in an x-t diagram (right). In all of the figures, it becomes clear that 
the sharp corners diffuse rapidly, as these correspond to the highest wavenum- 
bers. Eventually, the lowest wavenumber variations will also decay, until the 
temperature reaches a constant steady-state distribution, which is a solution of 
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Laplace’s equation uss = 0. When solving this PDE using the FFT, we are im- 
plicitly assuming that the solution domain is periodic, so that the right and left 
boundaries are identified and the domain forms a ring. However, if the domain 
is large enough, then the effect of the boundaries is small. 


Code 2.5: [MATLAB] Code to simulate the 1D heat equation using the Fourier 
transform. 


a= 1; 2% Thermal diffusivity constant 

L = 100; 2% Lengeh of domain 

N = 1000; % Number of discretization points 
dx = L/N; 

x = -L/2:dx:L/2-dx; % Define x domain 


% Define discrete wavenumbers 
kappa = (2*pi/L)* [-N/2:N/2=-1]; 
kappa = fftshift (kappa) ; 3 Re order ire wavenumbers 


6 innate COndTETON 
uO = Oxx; 
THO (72 VAIO Acca (E2 ab Aly AO) yAek-<\y ile 


% Simulate in Fourier frequency domain 
tee ORO Eo 
[t, uhat ]=ode45 (@(t,uhat) rhsHeat (t, uhat, kappa,a),t,fft (u0)); 
for k = l:length(t) < I1FET to return to spatial domain 
tae (ke) = IEEE (Uhat (k); 


Q 


2 Plot solution in time 
figure, waterfall ((u(1:10:end,:))); 
figure, imagesc(flipud(u)); 


Code 2.5: [Python] Code to simulate the 1D heat equation using the Fourier 
transform. 


a=l # Thermal diffusivity constant 

L = 100 # Length of domain 

N = 1000 # Number of discretization points 
dx = L/N 


x = np.arange(-L/2,L/2,dx) # Define x domain 


# Define discrete wavenumbers 
kappa = 2x*np.pixenp.fft.fftfreq(N, d=dx) 


T nonea COnanneTon 
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u0 = np.zeros_like (x) 
vO Paint (T2 = T10 / ase) int (T2 1/0) fax) | = l 
uhat — 0p. cie- ere. (u.0)) 


# Simulate in Fourier frequency domain 


der 0m] 

t = np.arange(0,10,dt) 

uhat_ri = odeint(rhsHeat, uOhat_ri, t, args- (kappa, a)) 
uhar = uhat rero N are Cai) ses har arab Rist | 


u = np.zeros_like(uhat) 
for k in range(len(t)): 

Use) ee ela were, I EEE hate lke 2 ly) 
u = u.real 


Code 2.6: [MATLAB] Right-hand side for 1D heat equation in Fourier domain, 
dt /dt. 

function duhatdt = rhsHeat (t,uhat, kappa, a) 

duhatdt = -a°2x*(kappa.°2)’.x*uhat; < Linear and diagonal 


Code 2.6: [Python] Right-hand side for 1D heat equation in Fourier domain, 
dt /dt. 
def rhsHeat (uhat_ri,t,kappa,a): 

Ghart Uhar SIN | ae (ily) as aE sean | 

d uhat = ar? = (np.power(kappa,2)) x uhat 

d uhat ri = np.concabenate ((douhat real, d uhat .-imag))- 

astype(’ float64’) 
return d_uhat_ri 


One-Way Wave Equation 


As second example is the simple linear PDE for the one-way equation: 
Uz + CUz = 0. (2.41) 


Any initial condition u(0, x) will simply propagate to the right in time with 
speed c, as u(t, x) = u(0,x — ct) is a solution. The code to simulate this PDE is 
nearly identical to the above code for the heat equation, and it is available on 
the book’s GitHub. In this example, we simulate this PDE for an initial condi- 
tion given by a Gaussian pulse. It is possible to integrate this equation in the 
Fourier transform domain, as before, using the vector field given by Code 
However, it is also possible to integrate this equation in the spatial domain, 
simply using the FFT to compute derivatives and then transform back. The so- 


lution u(t, x) is plotted in Figs. and as before. 
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Figure 2.15: Solution of the 1D wave equation in time. As time evolves, the 
Gaussian initial condition moves from left to right at a constant wave speed. 
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Figure 2.16: Evolution of the 1D wave equation in time, illustrated by a water- 
fall plot (left) and an z-t diagram (right). 


Code 2.7: [MATLAB] Right-hand side for 1D wave equation in Fourier domain. 


function duhatdt = rhsWave (t,uhat, kappa, c) 
duhatdt = -cx*ixkappa.*uhat; 


Code 2.7: [Python]Right-hand side for 1D wave equation in Fourier domain. 


def rhsWave (uhat_ri,t,kappa,c): 
uhat —" uhate ril NI chs GG) Sexe unat riin] 
d uhat = -cx(1)j)*kappaxuhat 
d uhat ri = np.concatenate((d_uhat.real,d_uhat.imag)). 
astype (’ float64’) 
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Figure 2.17: Solution of Burgers’ equation in time. As time evolves, the leading 
edge of the Gaussian initial condition steepens, forming a shock front. 


| return d_uhat_ri 


Burgers’ Equation 


For the final example, we consider the nonlinear Burgers’ equation, 
Ut + UUy = VUge, (2.42) 


which is a simple 1D example for the nonlinear convection and diffusion that 
gives rise to shock waves in fluids [336]. The nonlinear convection wu, essen- 
tially gives rise to the behavior of wave steepening, where portions of u with 
larger amplitude will convect more rapidly, causing a shock front to form. 

The code to simulate Burgers’ equation is on the book’s GitHub, giving rise 
to Figs. |2.17|and {2.18| Burgers’ equation is an interesting example to solve with 
the FFT, because the nonlinearity requires us to map into and out of the Fourier 
domain at each time-step, as shown in the vector field in Code In this ex- 
ample, we map into the Fourier transform domain to compute uz and uzz, and 
then map back to the spatial domain to compute the product uus. Figures 
and clearly show the wave steepening effect that gives rise to a shock. 
Without the damping term uss, this shock would become infinitely steep, but 
with damping, it maintains a finite width. 


Code 2.8: [MATLAB] Right-hand side for Burgers’ equation in Fourier trans- 
form domain. 


function dudt = rhsBurgers(t,u,kappa,nu) 
uhat = fft (u); 
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Figure 2.18: Evolution of Burgers’ equation in time, illustrated by a waterfall 
plot (left) and an x-t diagram (right). 


duhat = ixkappa.*xuhat; 
dduhbat — (kappa. 2) .xuhat,; 
du = ifft (duhat); 

ddu = ifft (dduhat); 

dudt = -u.*du + nuxddu; 


Code 2.8: [Python] Right-hand side for Burgers’ equation in Fourier transform 
domain. 


def rhsBurgers(u,t,kappa,nu): 


uhat = np. Ert Ere (GL) 

d_uhat = (1))*kappaxuhat 

dd_uhat = -np.power (kappa, 2) *uhat 
d u= spree eine (eatin ee) 


ddu E> ap Ere aere CGA uhat) 
du de === x duU r nudd uU 
return du_dt.real 


2.4 Gabor Transform and the Spectrogram 


Although the Fourier transform provides detailed information about the fre- 
quency content of a given signal, it does not give any information about when 
in time those frequencies occur. The Fourier transform is only able to charac- 
terize truly periodic and stationary signals, as time is stripped out via the in- 
tegration in (2.18). For a signal with non-stationary frequency content, such as 
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Figure 2.19: Illustration of the Gabor transform with a translating Gaussian 
window for the short-time Fourier transform. 


a musical composition, it is important to simultaneously characterize the fre- 
quency content and its evolution in time. 

The Gabor transform, also known as the short-time Fourier transform (STFT), 
computes a windowed FFT in a moving window [648], as shown in 
Fig. This STFT enables the localization of frequency content in time, result- 
ing in the spectrogram, which is a plot of frequency versus time, as demonstrated 
later in Figs. 2.21]and [2.22] The STFT is given by 


Git E J Toeri = (Ff, dope) (2.43) 


where gtw(T) is defined as 
Galt) = eT g(r — t). (2.44) 
The function g(t) is the kernel, and is often chosen to be a Gaussian: 
g(t) =e VI”, (2.45) 


The parameter a determines the spread of the short-time window for the Fourier 
transform, and 7 determines the center of the moving window. 
The inverse STFT is given by 


f(t) =G"(f,(t,w)) = TT / f hrt — r)e™? du dt. (2.46) 
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Figure 2.20: Power spectral density of quadratic chirp signal. 


Discrete Gabor Transform 


Generally, the Gabor transform will be performed on discrete signals, as with 
the FFT. In this case, it is necessary to discretize both time and frequency: 


v = jAw, (2.47) 
T= kAt. (2.48) 


The discretized kernel function becomes 
gir = e?"IO"' g(t — kAt) (2.49) 


and the discrete Gabor transform is 
fa = om) = | FOB) er. (2.50) 


This integral can then be approximated using a finite Riemann sum on dis- 
cretized functions f and 9j,- 


Example: Quadratic Chirp 
As a simple example, we construct an oscillating cosine function where the 
frequency of oscillation increases as a quadratic function of time: 


f(t) = cos(2rtw(t)) where w(t) = wo + (wi — wot? /3¢. (2.51) 


The frequency shifts from wo att = 0 to w: at t = tı. 

Figure [2.20|shows the power spectral density (PSD) obtained from the FFT 
of the quadratic chirp signal. Although there is a clear peak at 50 Hz, there is 
no information about the progression of the frequency in time. The code to 
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Figure 2.21: Spectrogram of quadratic chirp signal. The PSD is shown on the 
left, corresponding to the integrated power across rows of the spectrogram. 


generate the spectrogram is given in Code and the resulting spectrogram 
is plotted in Fig. where it can be seen that the frequency content shifts in 
time. 

Code 2.9: [MATLAB] Spectrogram of quadratic chirp, shown in Fig. 

E — ORO) ONO eae 


£0 = 50; 
FL > 2507; 
tl = 2. 


<7 erp dia il aeib, adea Ee], 
= COS pit. O P (el EO) ree A(x a; 
There is a typo in Matlab documentation... 
. divide by 3 so derivative amplitude matches frequency 
spectrogram (x, 28, 120, 128, le3,.’ yexis” ) 


xX 


a 
3 


Code 2.9: [Python] Spectrogram of quadratic chirp, shown in Fig. 


at = Om 001 
t = np.arange(0,2,dt) 


£0 = 50 
Eee 250 
El = 2 


xX = np.cos(2*np.pixtx(f0 + (f1-f0)*«np.power(t,2)/(3*t1l**2))) 
plt.specgram(x, NFFT=128, Fs=1/dt, noverlap=120,cmap=’ jet’ ) 


Example: Beethoven’s Sonata Pathétique 


It is possible to analyze richer signals with the spectrogram, such as Beethoven’s 
Sonata Pathétique, shown in Fig. The spectrogram is widely used to an- 
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alyze music, and has recently been leveraged in the Shazam algorithm, which 
searches for key point markers in the spectrogram of songs to enable rapid clas- 
sification from short clips of recorded music [740]. 

Figure|2.22|shows the first two bars of Beethoven’s Sonata Pathétique, along 
with the spectrogram. In the spectrogram, the various chords and harmonics 
can be seen clearly. A zoom-in of the frequency shows two octaves, and how 
cleanly the various notes are excited. Code [2.10] loads the data, computes the 
spectrogram, and plots the result. 


Code 2.10: [MATLAB] Compute spectrogram of Beethoven’s Sonata Pathétique 


(Fig. 2.22). 


% Download mp3read from http://www.mathworks.com/ 
matlabcentral/fileexchange/13852-mp3read-and-mp3write 
[Y,FS,NBITS,OPTS] = mp3read(’beethoven.mp3’); 


s% Spectrogram using ‘spectrogram’ comand 
T = 40; 3 40 seconds 


v= (srk Si; ¢ First 40 seconds 
spectrogram (y, 5000, 400,24000, 24000,’ yaxis’); 


3% Spectrogram using short-time Fourier transform ‘stft’ 
wlen = 5000; % Window length 

h=400; % Overlap is wlen - h 

Se ester a cite (ye weet, m ESA ES) s y axis 0=47000A7 


imagesc (log10 (abs (S))); % Plot spectrogram (log-scaled) 


To invert the spectrogram and generate the original sound: 


[Mot SE fe, career) = ielic(S, i, Fe74n Boh, 
sound (x_istft,FS); 


Artists, such as Aphex Twin, have used the inverse spectrogram of images to 
generate music. The frequency of a given piano key is also easily computed. 
For example, the 40th key frequency is given by 


freq = @(n) (((27 (1/12) )* (n-49) ) *440) ; 
freq(40) % frequency of 40th key = C 


Uncertainty Principles 


In time-frequency analysis, there is a fundamental uncertainty principle that 
limits the ability to simultaneously attain high resolution in both the time and 
frequency domains. In the extreme limit, a time series is perfectly resolved in 
time, but provides no information about frequency content, and the Fourier 
transform perfectly resolves frequency content, but provides no information 
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Figure 2.22: First two bars of Beethoven’s Sonata Pathétique (No. 8 in C Minor, 
Op. 13), along with annotated spectrogram. 
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Figure 2.23: Illustration of resolution limitations and uncertainty in time- 
frequency analysis. 


about when in time these frequencies occur. The spectrogram resolves both 
time and frequency information, but with lower resolution in each domain, 
as illustrated in Fig. An alternative approach, based on a multi-resolution 
analysis, will be the subject of the next section. 

Stated mathematically, the time-frequency uncertainty principle may 


be written as 
o rlf) “i alfu) aw) N (2.52) 
EA = — 167? 


This is true if f(x) is absolutely continuous and both x f (x) and f'(x) are square 
integrable. The function z°|f (x)|? is the dispersion about x = 0. For real-valued 
functions, this is the second moment, which measures the variance if f(x) is a 
Gaussian function. In other words, a function f(x) and its Fourier transform 
cannot both be arbitrarily localized. If the function f approaches a delta func- 
tion, then the Fourier transform must become broadband, and vice versa. This 
has implications for the Heisenberg uncertainty principle [320], as the position 
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and momentum wave functions are Fourier transform pairs. 

In time-frequency analysis, the uncertainty principle has implications for 
the ability to localize the Fourier transform in time. These uncertainty prin- 
ciples are known as the Gabor limit. As the frequency content of a signal is 
resolved more finely, we lose information about when in time these events oc- 
cur, and vice versa. Thus, there is a fundamental tradeoff between the simul- 
taneously attainable resolutions in the time and frequency domains. Another 
implication is that a function f and its Fourier transform cannot both have fi- 
nite support, meaning that they are localized, as stated in Benedick’s theorem 


[14) 72]. 


2.5 Laplace Transform 


The Laplacd!|transform is closely related to the Fourier transform, and it is used 
extensively in differential equations and control theory. Like the Fourier trans- 
form, the Laplace transform is used to transform PDEs into simpler ODEs, and 
it is also useful for transforming ODEs into algebraic equations. Here we will 
derive the Laplace transform as a generalized Fourier transform and demon- 
strate some of its useful properties. 

The Fourier transform is defined for well-behaved functions that decay suf- 
ficiently rapidly to zero as the domain goes to infinity, i.e., for Lebesgue inte- 
grable functions f € L'|[(—oo, oo)]|. However, many functions we are interested 
in, such as exponential functions e™%, the well-named Heaviside function 


0 fore =U, 
a) i fort > 0, 299) 


and the trigonometric functions sin(t) and cos(t), do not satisfy this property; 
see Fig. [2.24] for examples. It is technically possible to Fourier-transform some 
of these functions, such as trigonometric functions, by multiplying by a win- 
dow function and then taking the limit as the window becomes infinitely large. 
However, this approach does not translate to exponential functions, which are 
unbounded at either t — —oo or t — oo. This limitation rules out using the 
Fourier transform to analyze a large class of ODEs and PDEs. The Laplace 
transform is a Fourier-like transform that is valid for the larger class of func- 
tions that are not Lebesgue integrable, including exponential functions. 

We will consider the Laplace transform as a weighted, one-sided Fourier 
transform for badly behaved functions. The solution to transforming a function 
f(t) that is unbounded as t —> ov, such as f(t) = e%, is to first multiply it by a 


1Pierre-Simon Laplace was born the son of peasant farmers and is now immortalized on the 
Eiffel tower. He was also an early data scientist, realizing that real-world measurement data is 
noisy and imperfect, and must be viewed through the lens of probability theory. 
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Figure 2.24: The Fourier transform requires that functions are well behaved 
at +œ, as in the Gaussian function (a). Many functions are not well behaved 
and are difficult or impossible to Fourier-transform, such as the exponential 
function e% (b), the Heaviside function H (t) (c), and the cosine function (d). It is 
possible to Fourier-transform the cosine function by multiplying by a window 
function and then extending the window size to infinity, but this does not work 
for unstable functions, like the exponential. 


decaying exponential function e~”, where y is more damped than the growth 
of f(t); this is the weighting. Although this solves the unboundedness of f(t) 
as t — œ, now the function e~” is unbounded for t — —oo. Thus, we also 
multiply by the Heaviside function H(t), which forces the function to be zero 
for t < 0; thus the transformation is one-sided. Our new weighted, one-sided 
function F(t) is given by 


0 fort < 0, 


F(t) = feH (t) = l f(e-® fort > 0. (2.54) 


Figure shows this one-sided weighting procedure for a function that is 
unbounded as t + oo. We now take the Fourier transform F (w) of F(t), which 
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Figure 2.25: The Laplace transform is a weighted, one-sided Fourier transform 
for badly behaved functions. Given an unstable function f(t), it is possible to 
multiply by the Heaviside function H(t) and a sufficiently damped exponential 
function e7”, resulting in a function that may be Fourier-transformed. 


will be the Laplace transform f(s) of f(t): 
F(w) = F(F(t))= rr Fiije “d= a fhe “ed (2.55a) 
-[ Heo" dt = [ fe "dt = f(s), (2.55b) 


where we have introduced the Laplace variable s = 7 + iw. 
To derive the inverse Laplace transform, we will begin with the inverse 
Fourier transform of F (w): 


F(t) = F-(F(w)) = > D P(w) dw. (2.56) 


Multiplying both sides by e”, we recover f(t) H(t): 


FOH) = e” F(t) = =J et P (wje dw (2.57a) 
1 oo i 
= (ytiw)t 
z F(w)e dw. (2.57b) 


We may express the right-hand side in terms of the Laplace variable s = y + 
iw by noting that ds = idw => dw = (1/i)ds and changing the bounds of 
integration from —oo to oo in dw to y — ioo to y + too in ds: 


QTD Jaio 


y+ioo 
OVO eee f F(s)e* ds. (2.58) 
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This is the expression for the inverse Laplace transform of f(s). The coefficient 
1/i has been incorporated into the 1/(27i) term in front of the integral, and F'(w) 
has been replaced by f(s) from (2.55). 

Therefore, the Laplace transform pair is given by 


f(s) = D= Fe dt, (2.59a) 
y+ico 


OOE 


Note that in we have dropped the Heaviside function. This is equivalent 
to defining the Laplace transform as only being valid for functions f(t) defined 
on the semi-infinite domain t > 0. 

To summarize, the Laplace transform is a generalized Fourier transform de- 
signed to handle poorly behaved functions, such as exponentials. Even func- 
tions that have a Fourier transform often have a simpler Laplace transform. 
For example, the Dirac delta function, which requires an infinite number of 
Fourier frequencies to represent, is simply the constant 1 in the Laplace do- 
main; this property makes the Laplace transform particularly useful for study- 
ing impulse responses and systems with forcing. A number of properties of the 
Fourier transform carry over to the Laplace transform (see Exercise 2-10), mak- 
ing them useful for solving ODEs and PDEs, especially in the context of control 
theory. 


F(s)e* ds. (2.59b) 


Derivatives of Functions 


The Laplace transform of the derivative of a function is given by 


“(Si ()) = [io f(t Ta at (2.60) 


=| om -f Ft) [ose | a (2.60b) 


v du 


= -f0 + 5f(s). (2.600) 


The — f (0) term comes from [e~*' f(t)|§° since e7™* is 0 at t = oo and 1 att = 0. 
The formula for the Laplace transform of a higher derivative is given by 


L ($ ro) = FO) = 5 f-2)(0) Se AEE s”! f(0) + s” F(s), (2.61) 


where f*) denotes the kth derivative. This property is extremely useful, allow- 
ing us to convert PDEs into ODEs and ODEs into algebraic expressions. For 
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example, consider the linear second-order damped harmonic oscillator, 


#+at+br =0, (2.62) 
with initial condition x(0) = xp and (0) = vp. It is possible to Laplace-transform 
this equation, noting that L(t) = —a) + s%(s) and that L(z) = 
—vo + s£L(£) = —vo — szo + 87Z(s): 

s°a(s) — sao — vp + asz (s) — axo + bz(s) = 0. (2.63) 


Rearranging the terms with z(s) on one side and the constants on the other, it 
is possible to obtain a rational function for Z(s): 


SLo + Vo + AXO 


2 = — 
s“ +as +b) Z(s) = s£o + vo +azo => Z(s)= 2.64 
)2(s) (= Se (2.64) 
characteristic initial 

polynomial conditions 


It is possible to solve for x(t) by computing the inverse Laplace transform of 
(s). For simplicity, let a = 5, b = 4, £o = 2, and vp = —5. Then 


2 5 2 5 1 1 
= as = | => r(t) =e* +e". 


AE ea eera s+1 s+4 


This uses the identity L(e™%) = 1/(s — à), which is left as an exercise. The de- 
nominator of is the characteristic polynomial for the ODE in (2.62), so the 
roots determine the eigenvalues \; and A2. The initial conditions appear exclu- 
sively in the numerator, which determines the amplitude of the e*"* and e™* 
solutions. Exercise 2-11 will determine the solution for general a, b, £o, and vo. 


2.6 Wavelets and Multi-Resolution Analysis 


Wavelets extend the concepts in Fourier analysis to more general 
orthogonal bases, and partially overcome the uncertainty principle discussed 
above by exploiting a multi-resolution decomposition, as shown in Fig. [2.23(d). 
This multi-resolution approach enables different time and frequency fidelities 
in different frequency bands, which is particularly useful for decomposing com- 
plex signals that arise from multi-scale processes such as are found in climatol- 
ogy, neuroscience, epidemiology, finance, and turbulence. Images and audio 
signals are also amenable to wavelet analysis, which is currently the leading 
method for image compression [23], as will be discussed in subsequent sections 
and chapters. Moreover, wavelet transforms may be computed using similar 
fast methods [83], making them scalable to high-dimensional data. There are a 
number of excellent books on wavelets [707], in addition to the pri- 
mary references [192] [474]. 
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The basic idea in wavelet analysis is to start with a function y(t), known as 
the mother wavelet, and generate a family of scaled and translated versions of 
the function: 


Yap(t) = = Y (l P), (2.65) 


The parameters a and b are responsible for scaling and translating the function 
w, respectively. For example, one can imagine choosing a and b to scale and 
translate a function to fit in each of the segments in Fig. |2.23{d). If these func- 
tions are orthogonal, then the basis may be used for projection, as in the Fourier 
transform. 

The simplest and earliest example of a wavelet is the Haar wavelet, devel- 


oped in 1910 [307]: 


1 for0< t< 1/2, 
w(t) =< -1 for1/2<t<1, (2.66) 
0 otherwise. 
The three Haar wavelets, 41 o, 11/20, and 1/2,1/2, are shown in Fig. repre- 
senting the first two layers of the multi-resolution in Fig. [2.23{d). Notice that 
by choosing each higher frequency layer as a bisection of the next layer down, 
the resulting Haar wavelets are orthogonal, providing a hierarchical basis for a 
signal. 
The orthogonality property of wavelets described above is critical for the 
development of the discrete wavelet transform (DWT) below. However, we be- 
gin with the continuous wavelet transform (CWT), which is given by 


WEAD) = (Fao) = f FOTA, (2.67) 


where Ya » denotes the complex conjugate of Ya ». This is only valid for functions 
w(t) that satisfy the boundedness property that 


Gys / TEU ig cass (2.68) 


lwl 


oO 


The inverse continuous wavelet transform (iCWT) is given by 


10 = a f f WAN Dld z daas, (2.68) 


New wavelets may also be generated by the convolution w * ¢@ if ~ is a wave- 
let and ¢ is a bounded and integrable function. There are many other popular 
mother wavelets ~ beyond the Haar wavelet, designed to have various proper- 
ties. For example, the Mexican hat wavelet is given by 


w(t) =(1-B)e*??, (2.70a) 
w(w) = V2 we e, (2.70b) 
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Figure 2.26: Three Haar wavelets for the first two levels of the multi-resolution 


in Fig. 2.23{d). 


Discrete Wavelet Transform 


As with the Fourier transform and Gabor transform, when computing the wave- 
let transform on data, it is necessary to introduce a discretized version. The 
discrete wavelet transform (DWT) is given by 


Wo NK) = (fue) =f FOG salt) at 2.71) 
where Y; x(t) is a discrete family of wavelets 
—kb 
Dya(t) = 2 Y ( z ) (2.72) 


Again, if this family of wavelets is orthogonal, as in the case of the discrete Haar 
wavelets described above, it is possible to expand a function f(t) uniquely in 
this basis: 


co 


FO = So (FO, d Odin). (2.73) 


j,k=—0o 
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The explicit computation of a DWT is somewhat involved, and is the sub- 
ject of several excellent papers and texts [192] |474] |475] 523} [707]. However, the 
goal here is not to provide computational details, but rather to give a high-level 
idea of what the wavelet transform accomplishes. By scaling and translating a 
given shape across a signal, it is possible to efficiently extract multi-scale struc- 
tures in an efficient hierarchy that provides an optimal tradeoff between time 
and frequency resolution. This general procedure is widely used in audio and 
image processing, compression, scientific computing, and machine learning, to 
name a few examples. 


2.7 Two-Dimensional Transforms and Image Process- 
ing 


Although we analyzed both the Fourier transform and the wavelet transform 
on one-dimensional signals, both methods readily generalize to higher spatial 
dimensions, such as two-dimensional and three-dimensional signals. Both the 
Fourier and wavelet transforms have had tremendous impact on image pro- 
cessing and compression, which provides a compelling example to investigate 
higher-dimensional transforms. 


Two-Dimensional Fourier Transform for Images 


The two-dimensional (2D) Fourier transform of a matrix of data X € R”*™ is 
achieved by first applying the one-dimensional (1D) Fourier transform to every 
row of the matrix, and then applying the 1D Fourier transform to every column 
of the intermediate matrix. This sequential row-wise and column-wise Fourier 
transform is shown in Fig. Switching the order of taking the Fourier trans- 
form of rows and columns does not change the result. 

It is simple to compute the 2D FFT in MATLAB 


>> Phat = £62 (E); 
>> fb = LeEC2Z (Ihat); 


2D PRET 
2D Inverse FFT 


3 
3 


and in Python 
>>> that = np. the. fils (E); 7 2D FET 
>>> £ = np.fft.ifft2(fhat); # 2D Inverse FFT 


A code to compute the 2D Fourier transform via 1D row-wise and column-wise 
FFTs is provided on the book’s GitHub. 

The two-dimensional FFT is effective for image compression, as many of the 
Fourier coefficients are small and may be neglected without loss in image qual- 
ity. Thus, only a few large Fourier coefficients must be stored and transmitted. 
Code[2.1]jand Fig. |2.28|demonstrate the FFT for image compression. 
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FFT all rows FFT all columns 2D FFT 


Figure 2.27: Schematic of 2D FFT. First, the FFT is taken of each row, and then 
the FFT is taken of each column of the resulting transformed matrix. 


Code 2.11: [MATLAB] Image compression via the FFT. 


Bt=fft2 (B); % B is grayscale image from above 

Btsort = sort (abs (Bt(:))); * Sort by magnitude 

* Zero out all small coefficients and inverse transform 

for keep F rm OS Oi) 002 IF. 
thresh = Btsort (floor ((1-keep) «length (Btsort))); 
ind = abs (Bt) >thresh; Find small indices 
Atlow = Bt.+*ind; Threshold small indices 
Alow=uint8 (ifft2 (Atlow) ); 
figure, imshow(Alow) 

end 


Compressed image 
Ploe VRECONSEFUCE LON 


alo o ol ole 


Code 2.11: [Python] Image compression via the FFT. 
Be — np.írt. tte? (BB) 
Btsort = np.sort (np.abs(Bt.reshape(-1))) # sort by magnitude 


# Zero out all small coefficients and inverse transform 

for keep in (0S1 00S, 070m MOO 2): 
thresh = Btsort [int (np. floor ((l-keep) «len (Btsort) )) ] 
ind = np.abs(Bt)>thresh # Find small indices 
Atlow = Bt * ind # Threshold small indices 
Alow = np.fft.ifft2 (Atlow) .real # Compressed image 
plt.imshow (Alow, cmap=’ gray’ ) 


Finally, the FFT is extensively used for de-noising and filtering signals, as it 
is straightforward to isolate and manipulate particular frequency bands. Code 
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Full image 5.0% of FFT 


eee ae 


0.2% of FFT 


Nene vs “=a 


Figure 2.28: Compressed image using various thresholds to keep 5%, 1%, and 
0.2% of the largest Fourier coefficients. 


and Fig. [2.29]demonstrate the use of an FFT threshold filter to de-noise an im- 
age with Gaussian noise added. In this example, it is observed that the noise 
is especially pronounced in high-frequency modes, and we therefore zero-out 
any Fourier coefficient outside of a given radius containing low frequencies. 


Code 2.12: [MATLAB] Image de-noising via the FFT. 


Bnoise = B + uint8(200*randn (size(B))); % Add some noise 
Bt=fft2 (Bnoise) ; 

F = log (abs (Btshift)+1); 2 Put FFT on log-scale 
[nx,ny] = size(B); 

[X,Y] = meshgrid (-ny/2+1:ny/2,-nx/2+1:nx/2); 
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Noisy FFT 


Noisy image 


Filtered image Filtered FFT 


Figure 2.29: De-noising an image by eliminating high-frequency Fourier coeffi- 
cients outside of a given radius (bottom right). 


Rew ete ee 
ine: = Re<i5 092. 
BE Chistes «ets hates ane: 
Ffilt = log(abs(Btshiftfilt)+1); o Puc BET on Log—scale 
Behe — ie hes hisses (Bi sind be teislie)c 
Brile = IEfE2 (BECIE), 
Code 2.12: [Python] Image de-noising via the FFT. 
Bnoise = B + 200«np.random.randn(«*B.shape) .astype(’uint8’ ) 
Bie np. habits (Browse) 
Beshie — Op htt. tit Siatie) (Bites) 
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Figure 2.30: Illustration of three-level discrete wavelet transform. 


F = np.log( 


eae Se 


/241.,0x/ 
R2 = np.pow 
wmd = R 


np.abs (Btshift)+1) # Put FFT on log scale 


hape 


2+1)) 
er (X,2) 
150**2 


X,Y = np.meshgrid(np.arange (-ny/2+1,ny/2+1),np.arange (nx 


E nP- Power (Y, 2) 


Bib o siete ELLE 


= Btshi 


Me. ee ang 


Epa ee np. 


log(np.abs (Btshiftfilt)+1) # Put FFT on log scale 


Two-Dimensional Wavelet Transform for Images 


Similar to the FFT, the discrete wavelet transform is extensively used for image 
processing and compression. Code [2.13] computes the wavelet transform of an 
image, and the first three levels are illustrated in Fig. In this figure, the 
hierarchical nature of the wavelet decomposition is seen. The upper left corner 
of the DWT image is a low-resolution version of the image, and the subsequent 
features add fine details to the image. 


Code 2.13: [MATLAB] Example of a two-level wavelet decomposition. 


3% Wavelet decomposition (2 level) 


n= 2; Ww = dbl’; 


@ LEVEL 1 


[C;S] ~ wavedec2 (Bn; W); 
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Al = appcoef2 (C,S,w,l); < Approximation 
[Hil VA Dil] = detcoete (al, Cc, Sk) 2 Detalls 
Al = wcodemat (A1,128); 
Hl = wcodemat (H1,128); 
Vl = wcodemat (V1,128); 
D1 = wcodemat (D1,128); 
% LEVEL 2 
AZ = appcoefZ(C,S,w,l1); < Approximation 
[H> V2 DA] = sdetcocts ( a’ C, Sb) 2 Detalls 
A2 = wcodemat (A2,128); 
H2 = wcodemat (H2,128); 
V2 = wcodemat (V2, 128); 
D2 = wcodemat (D2,128); 
dec? = [A2 H2; V2 D2); 
decl = [imresize(dec2,size(H1)) H1 ; V1 D1]; 
image (decl1) ; 


Code 2.13: [Python] Example of a two-level wavelet decomposition. 


import pywt 


## Wavelet decomposition (2 level) 


n = 2 
w = “abl” 
coeffs = pywt.wavedec2 (B, wavelet=w, level=n) 


# normalize each coefficient array 
coeffs[0] /= np.abs(coeffs[0]) .max() 
for detail_level in range(n): 


coeffs[detail_level]l 


coeffs[detail_level + 1] 


= for d in 


[d/np.abs (d) .max () 
ap ak Ih || 


arr, coeff_slices 


pywt.coeffs_to_array(coeffs) 
plt.imshow (arr, cmap=’ gray’, vmin=—-0.25,vmax=0.75) 


Figure shows several versions of the compressed image for various 
compression ratios, as computed by Code[2.14} The hierarchical representation 
of data in the wavelet transform is ideal for image compression. Even with an 
aggressive truncation, retaining only 0.5% of the DWT coefficients, the coarse 
features of the image are retained. Thus, when transmitting data, even if band- 
width is limited and much of the DWT information is truncated, the most im- 
portant features of the data are transferred. 
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Full image 5.0% of wavelets 


Figure 2.31: Compressed image using various thresholds to keep 5%, 1%, and 
0.5% of the largest wavelet coefficients. 


Code 2.14: [MATLAB] Wavelet decomposition for image compression. 
[C,S] = wavedec2 (B,4,’db1’); 
Csort = sort(abs(C(:))); % Sort by magnitude 


for keep = [1 205) 201 005) 


thresh = Csort (floor ((1-keep) xlength(Csort))); 
ind = abs (C)>thresh; 


Cia. = Cind; % Threshold small indices 


o 


3 Plot Reconstruction 
Arecon=uint8 (waverec2 (Cfilt,S,’dbl’)); 
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figure, imagesc(uint8 (Arecon) ) 
end 


Code 2.14: [Python] Wavelet decomposition for image compression. 
n=4 
woe Tdp 
coeffs = pywt.wavedec2 (B, wavelet=w, level=n) 


coeff_arr, coeff_slices = pywt.coeffs_to_array(coeffs) 
Csort = np.sort (np.abs(coeff_arr.reshape(-1))) 


for keep an (0.1, 0.057 0 0l 07005): 
thresh = Csort [int (np.floor ((1-keep)xlen(Csort)))] 
ind = np.abs(coeff_arr) > thresh 
Cfilt = coeff arr x ind # Threshold small indices 


Coeris ilte =< pynt. array to CocrESs (Crile coche Slices, 
output_format=’ wavedec2’ ) 


# Plot reconstruction 
Arecon = pywt.waverec2 (coeffs_filt,wavelet=w) 
plt.imshow (Arecon.astype(’uint8’),cmap=’ gray’ ) 
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Suggested Reading 


Texts 

(1) The analytical theory of heat, by J.-B. J. Fourier, 1978 [249]. 
(2) A wavelet tour of signal processing, by S. Mallat, 1999 [475]. 
(3) Spectral methods in MATLAB, by L. N. Trefethen, 2000 [710]. 
Papers and reviews 


(1) An algorithm for the machine calculation of complex Fourier series, by 
J. W. Cooley and J. W. Tukey, Mathematics of Computation, 1965 [182]. 


(2) The wavelet transform, time-frequency localization and signal analysis, 
by I. Daubechies, IEEE Transactions on Information Theory, 1990 [192]. 


(3) An industrial strength audio search algorithm, by A. Wang et al., Ismir, 
2003 [740]. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


114 CHAPTER 2. FOURIER AND WAVELET TRANSFORMS 


Homework 
Exercise 2-1. Load the image dog.jpg and convert to grayscale. Use the FFT to 
compress the image at different compression ratios. Plot the error between the 


compressed and actual image as a function of the compression ratio. 


Exercise 2-2. Consider the following triangular wave: 


B 
JE 0 forz < —1, 
fle) =]<1—\2¢|  forļ|æ| <1, 
0 for 1 <z. 
0 i i i i i 
-2 =| 0 1 2 


Compute the Fourier series by hand for the domain —2 < x < 2. Plot the mode 
coefficients a, and b, for the first 100 cosine and sine modes (i.e., for the first 
n = 1 ton = 100). Also, plot the approximation using n = 10 modes on top of 
the true triangle wave. 


In a few sentences, explain the difference between the Fourier transform and 
the Fourier series. 


Exercise 2-3. Use the FFT to solve the Korteweg-de Vries (KdV) equation, 

Ut + Urner — UUg = 0, 
on a large domain with an initial condition u(x, 0) = sech(). Plot the evolution. 
Exercise 2-4. Use the FFT to solve the Kuramoto-Sivashinsky (KS) equation, 


2) 
= 9, 


on a large domain with an initial condition u(x, 0) = sech(x). Plot the evolution. 


Exercise 2-5. Solve for the analytic equilibrium temperature distribution using 
the 2D Laplace equation on an L x H sized rectangular domain with the fol- 
lowing boundary conditions. 


(a) Left: u,(0, y) = 0 (insulating). 

(b) Bottom: u(x, 0) = 0 (fixed temperature). 
(c) Top: u(x, H) = f(x) (zero temperature). 
(d) Right: u,(L,y) = 0 (insulating). 
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u(x, H) = f(x) 


uz(0, y) =0 uz(L, y) =0 


u(x,0)=0 


Solve for a general boundary temperature f(x). Also solve for a particular tem- 
perature distribution f(x); you may choose any non-constant distribution you 
like. 


How would this change if the left and right boundaries were fixed at zero tem- 
perature? (You do not have to solve this new problem, just explain in words 
what would change.) 


Exercise 2-6. Now, compute the solution to the 2D heat equation on a circular 
disk through simulation. Recall that the heat equation is given by 


Up = 0° Vu. 


For this problem, we will solve the heat equation using a finite-difference scheme 
on a Cartesian grid. We will use a grid of 300 x 300 with the circular disk in the 
center. The radius of the circle is r = 1, a = 1, and the domain is [—1.5, 1.5] in x 
and |—1.5, 1.5] in y. You can impose the boundary conditions by enforcing the 
temperature at points that are outside of the disk at the beginning of each new 
time-step. It should be easy to find points that are outside the disk, because 
they satisfy x? + y? > 1. 

Simulate the unsteady heat equation for the following boundary conditions: 


(a) The left half of the boundary of the disk is fixed at a temperature of u = 1 
and the right half of the boundary is fixed at u = 2. Try simulating this 
with zero initial conditions first. Next, try initial conditions inside the disk 
where the top half is u = —1 and the bottom half is u = 1. 


(b) The temperature at the boundary of the disk is fixed at u(@) = cos(6). 


Include your code and show some plots of your solutions to the heat equation. 
Plot the temperature distribution for each case (1) early on in the diffusion pro- 
cess, (2) near steady state, and (3) somewhere in the middle. 


Exercise 2-7. Consider the PDE for a vibrating string of finite length L, 


Utt = Clie 0 <r L, 
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with the initial conditions 
u(x,0)=0, “delat 0) = 0, 
and boundary conditions 
u(0,t)=0, us(L,t)= f(t). 


Solve this PDE by using the Laplace transform. You may keep your solution 
in the frequency domain, since the inverse transform is complicated. Please 
simplify as much as possible using functions like sinh and cosh. 


This PDE cannot be solved by separation of variables. Why not? (That is, try to 
solve with separation of variables until you hit a contradiction.) 


Exercise 2-8 [MATLAB]. Now, we will use the FFT to simultaneously compress 
and re-master an audio file. Please download the file r2112.mat and load the 
audio data into the matrix rush and the sample rate FS. 


(a) Listen to the audio signal (>>sound (rush, FS) ;). Compute the FFT of 
this audio signal. 


(b) Compute the power spectral density vector. Plot this to see what the out- 
put looks like. Also plot the spectrogram. 


(c) Now, download r2112noisy.mat and load this file to initialize the 
variable rushnoisy. This signal is corrupted with high-frequency 
artifacts. Manually zero the last three-quarters of the Fourier compo- 
nents of this noisy signal (if n=length(rushnoisy), then zero- 
out all Fourier coefficients from n/4:n). Use this filtered 
frequency spectrum to reconstruct the clean audio signal. When recon- 
structing, be sure to take the real part of the inverse Fourier 
transform: cleansignal=real (ifft (filteredcoefs) );. 


Because we are only keeping the first quarter of the frequency data, you 
must multiply the reconstructed signal by 2 so that it has the correct nor- 
malized power. Be sure to use the sound command to listen to the pre- 
and post-filtered versions. Plot the power spectral density and spectro- 
grams of the pre- and post-filtered signals. 


Exercise 2-9. The convolution integral and the impulse response may be used 
to simulate how an audio signal would sound under various conditions, such 
as in a long hallway, in a concert hall, or in a swimming pool. 


The basic idea is that you can record the audio response to an impulsive sound 
in a given location, like a concert hall. For example, imagine that you put a 
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microphone in the most expensive seats in the hall and then record the sound 
from a shotgun blast up on the stage. (Do not try this!!) Then, if you have a 
“flat” studio recording of some other audio, you can simulate how it would 
have sounded in the concert hall by convolving the two signals. 

Download and unzip sounds.zip to find various sounds and impulse-response 
filters. Convolve the various audio files (labeled sound1.wav, ...) with the vari- 
ous filters (labeled FilterxX YZ.wav, ...). In MATLAB, use the wavread command 
to load and the conv command to convolve. It is best to add 10% of the fil- 
tered audio (also known as “wet” audio) to 90% of the original audio (also 
known as “dry” audio). Listen to the filtered audio, as well as the original au- 
dio and the impulse-response filters (note that each sound has a sampling rate 
of FS=11, 025). However, you will need to be careful when adding the 10% 
filtered and 90% unfiltered signals, since the filtered audio will not necessarily 
have the same length as the filtered audio. 


There is a great video explaining how to actually create these impulse responses: 
http://www.audiocease.com/Pages/Altiverb/sampling.php 
Exercise 2-10. Verify the following properties of the Laplace transform. 


1 
(a) Exponential: L(e™) = st 


(b) Linearity: C(af(t) + bf(t)) = af(s) + bf (s) for all constants a,b € C. 
(c) Convolution: £(f (t) * 9(t)) = LF ŒL) = F(s)G(s). 

(d) Constant: £(1) = - 

(e) Delta function: £(5(t)) = 1. 


Exercise 2-11. Use the Laplace transform to solve for the general solution to 
(2.62) for arbitrary a, b, xo, and vo. 
Show how this solution changes when the system is forced with u(t): 


# + ar + bz = u(t). 


What if u(t) = 6(t)? What if u(t) = 1 for t > 0? 
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Sparsity and Compressed Sensing 


The inherent structure observed in natural data implies that the data admits 
a sparse representation in an appropriate coordinate system. In other words, 
if natural data is expressed in a well-chosen basis, only a few parameters are 
required to characterize the modes that are active, and in what proportion. All 
of data compression relies on sparsity, whereby a signal is represented more 
efficiently in terms of the sparse vector of coefficients in a generic transform 
basis, such as Fourier or wavelet bases. Recent fundamental advances in math- 
ematics have turned this paradigm upside down. Instead of collecting a high- 
dimensional measurement and then compressing, it is now possible to acquire 
compressed measurements and solve for the sparsest high-dimensional signal 
that is consistent with the measurements. This so-called compressed sensing is 
a valuable new perspective that is also relevant for complex systems in engi- 
neering, with potential to revolutionize data acquisition and processing. In this 
chapter, we discuss the fundamental principles of sparsity and compression as 
well as the mathematical theory that enables compressed sensing, all worked 
out on motivating examples. 

Our discussion on sparsity and compressed sensing will necessarily involve 
the critically important fields of optimization and statistics. Sparsity is a use- 
ful perspective to promote parsimonious models that avoid overfitting and re- 
main interpretable because they have the minimal number of terms required 
to explain the data. This is related to Occam’s razor, which states that the sim- 
plest explanation is generally the correct one. Sparse optimization is also useful 
for adding robustness with respect to outliers and missing data, which gener- 
ally skew the results of least-squares regression, such as the SVD. The topics 
in this chapter are closely related to randomized linear algebra discussed in 
Section {1.8} and they will also be used in several subsequent chapters. Sparse 
regression will be explored further in Chapter |4]and will be used in Section|7.3] 
to identify interpretable and parsimonious nonlinear dynamical systems mod- 
els from data. 
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3.1 Sparsity and Compression 


Most natural signals, such as images and audio, are highly compressible. This 
compressibility means that, when the signal is written in an appropriate basis, 
only a few modes are active, thus reducing the number of values that must be 
stored for an accurate representation. Said another way, a compressible signal 
x € R” may be written as a sparse vector s € R” (containing mostly zeros) in a 
transform basis © € R”*”: 


x= Ws. (3.1) 


Specifically, the vector s is called K-sparse in W if there are exactly K non-zero 
elements. If the basis © is generic, such as the Fourier or wavelet basis, then 
only the few active terms in s are required to reconstruct the original signal x, 
reducing the data required to store or transmit the signal. 

Images and audio signals are both compressible in Fourier or wavelet bases, 
so that after taking the Fourier or wavelet transform, most coefficients are small 
and may be set exactly equal to zero with negligible loss of quality. These few 
active coefficients may be stored and transmitted, instead of the original high- 
dimensional signal. Then, to reconstruct the original signal in the ambient space 
(i.e., in pixel space for an image), one need only take the inverse transform. As 
discussed in Chapter |2; the fast Fourier transform is the enabling technology 
that makes it possible to efficiently reconstruct an image x from the sparse co- 
efficients in s. This is the foundation of JPEG compression for images and MP3 
compression for audio. 

The Fourier modes and wavelets are generic or universal bases, in the sense 
that nearly all natural images or audio signals are sparse in these bases. There- 
fore, once a signal is compressed, one needs only to store or transmit the sparse 
vector s rather than the entire matrix WV, since the Fourier and wavelet trans- 
forms are already hard-coded on most machines. In Chapter[I]we found that it 
is also possible to compress signals using the SVD, resulting in a tailored basis. 
In fact, there are two ways that the SVD can be used to compress an image: 
(1) we may take the SVD of the image directly and only keep the dominant 
columns of U and V (Section|1.2): or (2) we may represent the image as a lin- 
ear combination of eigen-images, as in the eigenfaces example (Section|1.6). The 
first option is relatively inefficient, as the basis vectors U and V must be stored. 
However, in the second case, a tailored basis U may be computed and stored 
once, and then used to compress an entire class of images, such as human faces. 
This tailored basis has the added advantage that the modes are interpretable 
as correlation features that may be useful for learning. It is important to note 
that both the Fourier basis F and the SVD basis U are unitary transformations, 
which will become important in the following sections. 

Although the majority of compression theory has been driven by audio, 
image, and video applications, there are many implications for engineering 
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systems. The solution to a high-dimensional system of differential equations 
typically evolves on a low-dimensional manifold, indicating the existence of 
coherent structures that facilitate sparse representation. Even broadband phe- 
nomena, such as turbulence, may be instantaneously characterized by a sparse 
representation. This has a profound impact on how to sense and compute, as 
will be described throughout this chapter and the remainder of the book. 


Example: Image Compression 


Compression is relatively simple to implement on images, as described in Sec- 


tion|2.7/and revisited here (see Fig. |3.1;and Code}3.1). 
8 


Code 3.1: [MATLAB] Image compression based on the FFT. 


A=imread(’../CODE_DATA/CH03/jelly’, 'jpeg’); % Load image 
B=rgb2gray (A) ; 2 Converte image GO grayscale 
imshow (B) % Plot image 


3% Compute the FFT of image using fft2 
Bhat=fft2 (B); 


3%% Zero out all small coefficients and inverse transform 
Bhatsort = sort (abs (Bhat (:))); 

keep = 0.05; 

thresh = Bhatsort (floor ((1-keep)x»length (Bhatsort))); 

ind = abs (Bhat)>thresh; 

Bhatcompressed = Bhat.x«ind; 
Bcompressed=uint8 (if ft2 (Bhatcompressed) ); 


Code 3.1: [Python] Image compression based on the FFT. 
= Imread(ost path. Join (DATA; jelly. apa ))) 
np.mean(A, -1); # Convert RGB to grayscale 
plt.imshow(B, cmap=’ gray’ ) 


wW > 
M 


## Compute FFT of image using fft2 
Bhat — np. fit. trt2(B) 


## Zero out all small coefficients and inverse transform 
Bhatsort = np-sort (np.abs(np.reshape (Bhat,-1))) 

keep = 0.05 

thresh = Bhatsort [int (np.floor((l-keep) «len (Bhatsort) ))] 
ind = np.abs(Bhat) > thresh 

Bhatcompress = Bhat * ind 

Beompress’ = Np. fib. ites (Bhatcompress) .astype (C uinte”) 
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Full image Fourier coefficients 


Truncate 
, (keep 5%) 
Compressed image 


Figure 3.1: Illustration of compression with the fast Fourier transform (FFT) F. 


To understand the role of the sparse Fourier coefficients in a compressed 
image, it helps to view the image as a surface, where the height of a point is 
given by the brightness of the corresponding pixel. This is shown in Fig. [3.24 
Here we see that the surface is relatively simple, and may be represented as a 
sum of a few spatial Fourier modes. 
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Figure 3.2: Compressed image (left), and viewed as a surface (right). 


Why Signals Are Compressible: the Vastness of Image Space 


It is important to note that the compressibility of images is related to the over- 
whelming dimensionality of image space. For even a simple 20 x 20 pixel black- 
and-white image, there are 2% distinct possible images, which is larger than 
the number of nucleons in the known universe. The number of images is con- 
siderably more staggering for higher-resolution images with greater color depth. 

In the space of one megapixel images (i.e., 1000 x 1000 pixels), there is an 
image of us each being born, of me typing this sentence, and of you reading it. 
However vast the space of these natural images, they occupy a tiny, minuscule 
fraction of the total image space. The majority of the images in image space 
represent random noise, resembling television static. For simplicity, consider 
grayscale images, and imagine drawing a random number for the gray value 
of each of the pixels. With exceedingly high probability, the resulting image 
will look like noise, with no apparent significance. You could draw these ran- 
dom images for an entire lifetime and never find an image of a mountain, or a 
person, or anything physically recognizable] 


The vastness of signal space was described in Borges’s “The Library of Babel” in 1944, 
where he describes a library containing all possible books that could be written, of which actual 
coherent books occupy a nearly immeasurably small fraction [97]. In Borges’s library, there are 
millions of copies of this very book, with variations on this single sentence. Another famous 
variation on this theme considers that, given enough monkeys typing on enough typewriters, 
one would eventually recreate the works of Shakespeare. One of the oldest related descriptions 
of these combinatorially large spaces dates back to Aristotle. 
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Figure 3.3: Illustration of the vastness of image (pixel) space, with natural im- 
ages occupying a vanishingly small fraction of the space. 


In other words, natural images are extremely rare in the vastness of image 
space, as illustrated in Fig. Because so many images are unstructured or 
random, most of the dimensions used to encode images are only necessary for 
these random images. These dimensions are redundant if all we cared about 
was encoding natural images. An important implication is that the images we 
care about (i.e., natural images) are highly compressible, if we find a suitable 
transformed basis where the redundant dimensions are easily identified. 


3.2 Compressed Sensing 


Despite the considerable success of compression in real-world applications, it 
still relies on having access to full high-dimensional measurements. The recent 
advent of compressed sensing turns the 
compression paradigm upside down: instead of collecting high-dimensional 
data just to compress and discard most of the information, it is instead pos- 
sible to collect surprisingly few compressed or random measurements and then 
infer what the sparse representation is in the transformed basis. The idea be- 
hind compressed sensing is relatively simple to state mathematically, but, un- 
til recently, finding the sparsest vector consistent with measurements was a 
non-polynomial (NP) hard problem. The rapid adoption of compressed sensing 
throughout the engineering and applied sciences rests on the solid mathemat- 
ical framework?| that provides conditions for when it is possible to reconstruct 
the full signal with high probability using convex algorithms. 


Interestingly, the incredibly important collaboration between Emmanuel Candés and Ter- 
rance Tao began with them discussing the odd properties of signal reconstruction at their kids’ 
daycare. 
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Mathematically, compressed sensing exploits the sparsity of a signal in a 
generic basis to achieve full signal reconstruction from surprisingly few mea- 
surements. If a signal x is K-sparse in WV, then instead of measuring x directly 
(n measurements) and then compressing, it is possible to collect dramatically 
fewer randomly chosen or compressed measurements and then solve for the 
non-zero elements of s in the transformed coordinate system. The measure- 
ments y € R”, with K < p < n are given by 


y = Cx. (3.2) 


The measurement matrix] C € R?°™” represents a set of p linear measurements 
on the state x. The choice of measurement matrix C is of critical importance in 
compressed sensing, and is discussed in Section 3.4] Typically, measurements 
may consist of random projections of the state, in which case the entries of C 
are Gaussian or Bernoulli distributed random variables. It is also possible to 
measure individual entries of x (i.e., single pixels if x is an image), in which 
case C consists of random rows of the identity matrix. 

With knowledge of the sparse vector s, it is possible to reconstruct the signal 
x from (3.1). Thus, the goal of compressed sensing is to find the sparsest vector 
s that is consistent with the measurements y: 


y = CWs = Os. (3.3) 


The system of equations in (3.3) is under-determined since there are infinitely 
many consistent solutions s. The sparsest solution ŝ satisfies the following opti- 
mization problem: 


S=argmin||s||) subjectto y =CWs, (3.4) 


where || - ||) denotes the o-pseudo-norm, given by the number of non-zero en- 
tries; this is also referred to as the cardinality of s. 

The optimization in is non-convex, and in general the solution can only 
be found with a brute-force search that is combinatorial in n and K. In partic- 
ular, all possible K-sparse vectors in R” must be checked; if the exact level of 
sparsity is unknown, the search is even broader. Because this search is com- 
binatorial, solving is intractable for even moderately large n and K, and 
the prospect of solving larger problems does not improve with Moore’s law of 
exponentially increasing computational power. 

Fortunately, under certain conditions on the measurement matrix C, it is 
possible to relax the optimization in to a convex ¢-minimization [157,204]: 


$ = argmin ||s||ı subjectto y =CWs, (3.5) 


3In the compressed sensing literature, the measurement matrix is often denoted ®; instead, 
we use C to be consistent with the output equation in control theory. Also, ® is already used 
to denote DMD modes in Chapter|7] 
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Figure 3.4: Schematic of measurements in the compressed sensing framework. 


where ||- ||; is the ¢;-norm, given by 


Isl = $ Ise. (3.6) 
k=1 


The ¢|-norm is also known as the taxicab or Manhattan norm because it repre- 
sents the distance a taxi would take between two points on a rectangular grid. 
The overview of compressed sensing is shown schematically in Fig. The 
¢,-minimum-norm solution is sparse, while the ¢2-minimum-norm solution is 
not, as shown in Fig. 

There are very specific conditions that must be met for the ¢,-minimization 
in to converge with high probability to the sparsest solution in 
(151) [156]. These will be discussed in detail in Section(3.4} although they may be 


summarized as follows. 


(a) The measurement matrix C must be incoherent with respect to the spar- 
sifying basis ¥, meaning that the rows of C are not correlated with the 
columns of W. 


(b) The number of measurements p must be sufficiently large, on the order of 
p= O(K log(n/K)) ~ kK log(n/K). (3.7) 
The constant multiplier kı depends on how incoherent C and W are. 


Roughly speaking, these two conditions guarantee that the matrix CW acts asa 
unitary transformation on K-sparse vectors s, preserving relative distances be- 
tween vectors and enabling almost certain signal reconstruction with ¢, convex 
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(a) Sparse s (4) (b) Least-squares s (l2) 


Figure 3.5: The 4- and ¢.-minimum-norm solutions to the compressed sensing 
problem. The difference in solutions for this regression is further considered in 


Chapter 


minimization. This is formulated precisely in terms of the restricted isometry 
property (RIP) in Section[3.4] 

The idea of compressed sensing may be counter-intuitive at first, especially 
given classical results on sampling requirements for exact signal reconstruction. 
For instance, the Shannon—Nyquist sampling theorem states that per- 
fect signal recovery requires that it is sampled at twice the rate of the highest 
frequency present. However, this result only provides a strict bound on the re- 
quired sampling rate for signals with broadband frequency content. Typically, 
the only signals that are truly broadband are those that have already been com- 
pressed. Since an uncompressed signal will generally be sparse in a transform 
basis, the Shannon—Nyquist theorem may be relaxed, and the signal may be re- 
constructed with considerably fewer measurements than given by the Nyquist 
rate. However, even though the number of measurements may be decreased, 
compressed sensing does still rely on precise timing of the measurements, as we 
will see. Moreover, the signal recovery via compressed sensing is not strictly 
speaking guaranteed, but is instead possible with high probability, making it 
foremost a statistical theory. However, the probability of successful recovery 
becomes astronomically large for moderate-sized problems. 


Disclaimer 


A rough schematic of compressed sensing is shown in Fig. 3.6] However, this 
schematic is a dramatization, and is not actually based on a compressed sensing 
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Figure 3.6: Schematic illustration of compressed sensing using ¢;-minimization. 
Note that this is a dramatization, and is not actually based on a compressed 
sensing calculation. Typically, compressed sensing of images requires a signifi- 
cant number of measurements and is computationally prohibitive. 


calculation, since using compressed sensing for image reconstruction is com- 
putationally prohibitive. It is important to note that for the majority of applica- 
tions in imaging, compressed sensing is not practicable. However, images are 
often still used to motivate and explain compressed sensing because of their 
ease of manipulation and our intuition for pictures. In fact, we are currently 
guilty of this exact misdirection. 

Upon closer inspection of this image example, we are analyzing an im- 
age with 1024 x 768 pixels, and approximately 5% of the Fourier coefficients 
are required for accurate compression. This puts the sparsity level at K = 
0.05 x 1024 x 768 ~ 40000. Thus, a back-of-the-envelope estimate using (3.7), 
with a constant multiplier of kı = 3, indicates that we need p ~ 350000 mea- 
surements, which is about 45 % of the original pixels. Even if we had access to 
these 45 % random measurements, inferring the correct sparse vector of Fourier 
coefficients is computationally prohibitive, much more so than the efficient 
FFT-based image compression in Section [3.1] 

Compressed sensing for images is typically only used in special cases where 
a reduction of the number of measurements is significant. For example, an early 
application of compressed sensing technology was for infant MRI (magnetic 
resonance imaging), where reduction of the time a child must be still could 
reduce the need for dangerous heavy sedation. 

However, it is easy to see that the number of measurements p scales with the 
sparsity level K, so that if the signal is more sparse, then fewer measurements 
are required. The viewpoint of sparsity is still valuable, and the mathematical 
innovation of the convex relaxation of combinatorially hard /ọ problems to con- 
vex £; problems may be used much more broadly than for compressed sensing 
of images. 
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Alternative Formulations 


In addition to the ¢;-minimization in (3.5), there are alternative approaches 
based on greedy algorithms 
that determine the sparse solution of through an iterative match- 
ing pursuit problem. For instance, the compressed sensing matching pursuit 
(CoSaMP) is computationally efficient, easy to implement, and freely avail- 
able. 

When the measurements y have additive noise, say white noise of magni- 
tude e, there are variants of that are more robust: 


S=argmin||s||,; subjectto ||CWs — y||2 < €. (3.8) 


A related convex optimization is the following: 


ŝ = argmin ||CĦs — y||2 + Als||1, (3.9) 


where > 0 is a parameter that weights the importance of sparsity. Equations 


and are closely related [716]. 


3.3 Compressed Sensing Examples 


This section explores concrete examples of compressed sensing for sparse sig- 
nal recovery. The first example shows that the ¢;-norm promotes sparsity when 
solving a generic under-determined system of equations, and the second exam- 
ple considers the recovery of a sparse two-tone audio signal with compressed 
sensing. 


The ¢;-norm and Sparse Solutions to an Under-determined Sys- 
tem 


To see the sparsity-promoting effects of the /;-norm, we consider a generic 
under-determined system of equations. We build a matrix system of equations 
y = Os with p = 200 rows (measurements) and n = 1000 columns (unknowns). 
In general, there are infinitely many solutions s that are consistent with these 
equations, unless we are very unfortunate and the row equations are linearly 
dependent while the measurements are inconsistent in these rows. In fact, this 
is an excellent example of the probabilistic thinking used more generally in 
compressed sensing: if we generate a linear system of equations at random, 
that has sufficiently many more unknowns than knowns, then the resulting 
equations will have infinitely many solutions with high probability. 
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Figure 3.7: Comparison of ¢;-minimum-norm (blue, left) and ¢):-minimum- 
norm (red, right) solutions to an under-determined linear system. 


In MATLAB, it is straightforward to solve this under-determined linear sys- 
tem for both the minimum /,-norm and minimum /5-norm solutions. The mini- 
mum /)-norm solution is obtained using the pseudo-inverse (related to the SVD 
from Chapters|i]and|4). The minimum ¢;-norm solution is obtained via the evx 
(ConVeX) optimization package [293]. Figure [3.7| shows that the /;-minimum 
solution is in fact sparse (with most entries being nearly zero), while the 42- 
minimum solution is dense, with a bit of energy in each vector coefficient. 


Code 3.2: [MATLAB] Solutions to under-determined linear system y = Os. 


o SOMve y — Theta = 5 FOr is 

n = 1000; % dimension of s 

p = 200; % number of measurements, dim/(y) 
Theta = randn (p,n); 
y = randn(p,1); 


% L1 minimum norm solution s LI 
cvx_begin; 
variable sid (mn); 


minimize( norm(s_L1,1) ); 
subject to 
Thetaxs_Ll == y; 
cvx_end; 


s_L2 = pinv (Theta) «y; o LA minimum norm solutron s L2 
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Code 3.2: [Python] Solutions to under-determined linear system y = Os. 


from scipy.optimize import minimize 

# Solve y = Theta * s for "s" 

n = 1000 # dimension of s 

p = 200 # number of measurements, dim/(y) 
Theta = np.random.randn(p,n) 

y = np.random.randn (p) 


# L1 Minimum norm solution s Ll 
def L1 norm(x): 
return np.linalg.norm(x,ord=1) 


conste = (( type: eg, "fun: lambda x: Theta ~ x = y?) 
x0 = np.linalg.pinv(Theta) @ y # initialize with L2 solution 
res = minimize(Ll_norm, x0, method=’SLSQP’,constraints= 


conse.) 
s LI = res. 


Recovering an Audio Signal from Sparse Measurements 


To illustrate the use of compressed sensing to reconstruct a high-dimensional 
signal from a sparse set of random measurements, we consider a signal consist- 
ing of a two-tone audio signal: 


x(t) = cos(2r x 97t) + cos(2m x 777t). (3.10) 


This signal is clearly sparse in the frequency domain, as it is defined by a sum 
of exactly two cosine waves. The highest frequency present is 777 Hz, so that 
the Nyquist sampling rate is 1554 Hz. However, leveraging the sparsity of the 
signal in the frequency domain, we can accurately reconstruct the signal with 
random samples that are spaced at an average sampling rate of 128 Hz, which 
is well below the Nyquist sampling rate. Figure |3.8} shows the result of com- 
pressed sensing, as implemented in Codeß.3] In this example, the full signal is 
generated from t = 0 to t = 1 with a resolution of n = 4096 and is then ran- 
domly sampled at p = 128 locations in time. The sparse vector of coefficients in 
the discrete cosine transform (DCT) basis is solved for using matching pursuit. 


Code 3.3: [MATLAB] Compressed sensing of two-tone cosine signal. 


3% Generate signal, DCT of signal 

= 4096; $% pornts iM high resolution signal 
linspace (0, 1, n); 

= Cos (2s 97 * pi x £) + cos(2% 777 x pi x t); 
xt = ££t(x); < Fourier transformed signal 

PSD = xt.*conj(xt)/n; % Power spectral density 


X Gt B 
II 
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Figure 3.8: Compressed sensing reconstruction of a two-tone audio signal given 


by 


x(t) = cos(2m x 97t) + cos(2m x 777t). The full signal and power spectral 


density are shown in panels (a) and (b), respectively. The signal is measured 
at random sparse locations in time, demarcated by red points in (a), and these 
measurements are used to build the compressed sensing estimate in (c) and 
(d). The time series shown in (a) and (c) are a zoom-in of the entire time range, 
which is from t = 0 tot = 1. 


5 Randomly sample signal 
p = 128; 3 num. random samples, p=n/32 
perm = round(rand(p, 1) x n); 


o 


y = x(perm); % compressed measurement 


3% Solve compressed sensing problem 
Psi = det (eye(n, n)); 2 PULE PSF 
Theta = Psi(perm, :); % Measure rows of Psi 


Ss — cCosanp Theta, y U0; dee lO) rOn e CS Vita MacChirngs DuURStwiie 


xrecon = idct(s); * reconstruct full signal 


Code 3.3: [Python] Compressed sensing of two-tone cosine signal. 


from cosamp_fn import cosamp 
## Generate signal, DCT of signal 
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n = 4096 # points in high resolution signal 
t = np.linspace(0,1,n) 
<r io) ECE A sy Th ee a Fe E ae yogic, Sa I) ee e “ey i) 


xt = np.fft .fft (x) 7 Fourier Eranstormed signal 
PSD = xt * np.cony(xt) / n # Power spectral density 


## Randomly sample signal 

p = 128 # num. random samples, p = n/32 

perm = np.floor(np.random.rand(p) x n).astype (int) 
y = x[perm] 


## Solve compressed sensing problem 

Psi = det(ne vdentrry (im) 7 Build Psi 

Theta = Psi[perm, :] # Measure rows of Psi 
# CS via matching pursuit 

s = cosamp(Theta,y,10,epsilon=1.e-10,max_iter=10) 
xrecon = idet (sS) # reconstruct full signal 


It is important to note that the p = 128 measurements are randomly cho- 
sen from the 4096 resolution signal. Thus, we know the precise timing of the 
sparse measurements at a much higher resolution than our sampling rate. If 
we chose p = 128 measurements uniformly in time, the compressed sensing 
algorithm fails. Specifically, if we compute the PSD directly from these uniform 
measurements, the high-frequency signal will be aliased, resulting in erroneous 
frequency peaks. 

In the compressed sensing matching pursuit (CoSaMP) code, the desired 
level of sparsity K must be specified, and this quantity may not be known 
ahead of time. The alternative ¢,-minimization routine does not require knowl- 
edge of the desired sparsity level a priori, although convergence to the sparsest 
solution relies on having sufficiently many measurements p, which indirectly 
depends on K. 


3.4 The Geometry of Compression 


Compressed sensing can be summarized in a relatively simple statement: A 
given signal, if it is sufficiently sparse in a known basis, may be recovered 
(with high probability) using significantly fewer measurements than the signal 
length, if there are sufficiently many measurements and these measurements 
are sufficiently random. Each part of this statement can be made precise and 
mathematically rigorous in an overarching framework that describes the geom- 
etry of sparse vectors, and how these vectors are transformed through random 
measurements. Specifically, enough good measurements will result in a matrix 


@=CwU (3.11) 
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Figure 3.9: Different @,-norms establish different shapes for level sets of con- 
stant “radius” or distance from the origin. In the /2-norm, these correspond to 
circles, whereas in the 4; case, these are diamonds. The blue line represents the 
solution set of an under-determined system of equations y = Os, and the red 
curves represent the minimum-norm level sets that intersect this blue line for 
different norms. In the ¢:-norm, the minimum-norm solution is not sparse, as 
it has components of sı and sz, while in the ¢;-norm, the solution is sparse, as 
581 = 0. 


that preserves the distance and inner product structure of sparse vectors s. In 
other words, we seek a measurement matrix C so that © acts as a near-isometry 
map on sparse vectors. Isometry literally means same distance, and is closely 
related to unitarity, which preserves not only distance, but also angles between 
vectors. When © acts as a near-isometry, it is possible to solve the following 
equation for the sparsest vector s using convex ¢;-minimization: 


y = Os. (3.12) 


The remainder of this section describes the conditions on the measurement ma- 
trix C that are required for © to act as a near-isometry map with high proba- 
bility. The geometric properties of various norms are shown in Figs. and 
3.10 

Determining how many measurements to take is relatively simple. If the 
signal is K-sparse in a basis Y, meaning that all but K coefficients are zero, then 
the number of measurements scales as p ~ O(K log(n/K)) = kıKlog(n/K), 
as in (3.7). The constant multiplier kı, which defines exactly how many mea- 
surements are needed, depends on the quality of the measurements. Roughly 
speaking, measurements are good if they are incoherent with respect to the 
columns of the sparsifying basis, meaning that the rows of C have small in- 
ner product with the columns of W. If the measurements are coherent with 
columns of the sparsifying basis, then a measurement will provide little infor- 
mation unless that basis mode happens to be non-zero in s. In contrast, inco- 
herent measurements are excited by nearly any active mode, making it possible 
to infer the active modes. Delta functions are incoherent with respect to Fourier 
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Figure 3.10: The minimum-norm point on a line in different /,-norms. The blue 
line represents the solution set of an under-determined system of equations, 
and the red curves represent the minimum-norm level sets that intersect this 
blue line for different norms. In the norms between %ọ and 4, the minimum- 
norm solution also corresponds to the sparsest solution, with only one coor- 


dinate active. In the 0; and higher norms, the minimum-norm solution is not 
sparse, but has all coordinates active. 


modes, as they excite a broadband frequency response. The more incoherent the 
measurements, the smaller the required number of measurements p. 
The incoherence of measurements C and the basis W is given by u(C, ®): 


(C, Y) = vn max (ce, p), (3.13) 


where c; is the kth row of the matrix C and w, is the jth column of the matrix Y. 
The incoherence u will range between 1 and yn. The formula for incoherence 
above only makes sense when the rows of C and columns of W are normalized 
to have unit length. 


The Restricted Isometry Property (RIP) 


When measurements are incoherent, the matrix CW satisfies a restricted isometry 
property (RIP) for sparse vectors s, 


(1 —dx)\Isllz < CPs] < (1 + dx) IIshlp, 


with restricted isometry constant ôg [154]. The constant 6x is defined as the 
smallest number that satisfies the above inequality for all K-sparse vectors s. 
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Figure 3.11: Examples of good random measurement matrices C. 


When ôx is small, then CW acts as a near-isometry on K-sparse vectors s. In 
practice, it is difficult to compute 6% directly; moreover, the measurement ma- 
trix C may be chosen to be random, so that it is more desirable to derive statis- 
tical properties about the bounds on ôx for a family of measurement matrices 
C, rather than to compute ôx for a specific C. Generally, increasing the number 
of measurements will decrease the constant ôx, improving the property of CW 
to act isometrically on sparse vectors. When there are sufficiently many inco- 
herent measurements, as described above, it is possible to accurately determine 
the K non-zero elements of the n-length vector s. In this case, there are bounds 
on the constant ôx that guarantee exact signal reconstruction for noiseless data. 
An in-depth discussion of incoherence and the RIP can be found in [53} [154]. 


Incoherence and Measurement Matrices 


Another significant result of compressed sensing is that there are generic sam- 
pling matrices C that are sufficiently incoherent with respect to nearly all trans- 
form bases. Specifically, Bernoulli and Gaussian random measurement matri- 
ces satisfy the RIP for a generic basis ¥ with high probability [153]. There are 
additional results generalizing the RIP and investigating incoherence of sparse 
matrices [276]. 

In many engineering applications, it is advantageous to represent the sig- 
nal x in a generic basis, such as Fourier or wavelets. One key advantage is that 
single-point measurements are incoherent with respect to these bases, exciting 
a broadband frequency response. Sampling at random point locations is ap- 
pealing in applications where individual measurements are expensive, such as 
in ocean monitoring. Examples of random measurement matrices, including 
single-pixel, Gaussian, Bernoulli, and sparse random, are shown in Fig. 

A particularly useful transform basis for compressed sensing is obtained 
by the SVD resulting in a tailored basis in which the data is optimally sparse 


“The SVD provides an optimal low-rank matrix approximation, and it is used in principal 
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Figure 3.12: Examples of a bad measurement matrix C. 


420]. A truncated SVD basis may result in a more efficient 
signal recovery from fewer measurements. Progress has been made developing 
a compressed SVD and PCA based on the Johnson-Lindenstrauss (JL) lemma 
250) [277] 351) [571]. The JL lemma is closely related to the RIP, indicating when 
it is possible to embed high-dimensional vectors in a low-dimensional space 
while preserving spectral properties. 


Bad Measurements 


So far we have described how to take good compressed measurements. Fig- 
ure3.12|shows a particularly poor choice of measurements C, corresponding to 
the last p columns of the sparsifying basis W. In this case, the product © = CW 
is a p x p identity matrix padded with zeros on the left. In this case, any signal 
s that is not active in the last p columns of W is in the null space of ©, and is 
completely invisible to the measurements y. In this case, these measurements 


component analysis (PCA) and proper orthogonal decomposition (POD). 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


3.5. SPARSE REGRESSION 137 


outlier 


Figure 3.13: Least-squares regression is sensitive to outliers (red), while mini- 
mum ¢)-norm regression is robust to outliers (blue). 


incur significant information loss for many sparse vectors. 


3.5 Sparse Regression 


The use of the ¢;-norm to promote sparsity significantly pre-dates compressed 
sensing. In fact, many benefits of the ¢;-norm were well known and oft used in 
statistics decades earlier. In this section, we show that the ¢;-norm may be used 
to regularize statistical regression, both to penalize statistical outliers and also 
to promote parsimonious statistical models with as few factors as possible. The 
role of l versus 4 in regression is further detailed in Chapter|4| 


Outlier Rejection and Robustness 


Least-squares regression is perhaps the most common statistical model used 
for data fitting. However, it is well known that the regression fit may be ar- 
bitrarily corrupted by a single large outlier in the data; outliers are weighted 
more heavily in least-squares regression because their distance from the fit line 
is squared. This is shown schematically in Fig. 

In contrast, ;-minimum solutions give equal weight to all data points, mak- 
ing it potentially more robust to outliers and corrupt data. This procedure is 
also known as least absolute deviations (LAD) regression, among other names. 
A script demonstrating the use of least-squares (£2) and LAD (4) regression for 
a data set with an outlier is given in Code[3.4] 
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Code 3.4: [MATLAB] Use of ¢;-norm for robust statistical regression. 
x = sert: (4% (rand'(25, 1) —..5) )\; 2 Random data from [=2, 2] 


b = .9xx + .1l*evrandn(size(x)); % Line y=.9x with noise 
atrue = x\b; © Least-squares slope (no outliers) 
b(end) = -5.5; Introduce outlier 


acorrupkt = x\b; New slope 


cvx_begin; $ LI optimization to reject outlier 
variable all; % aLI is slope to be optimized 
minimize( norm(aLl+*x-b,1) ); s aLl is robust 


cvx_end; 
Code 3.4: [Python] Use of ¢;-norm for robust statistical regression. 
xX = np.sort (4* (np.random.rand(25,1)-0.5),axis=0) # data 


b = 0.9*x + O.l*np.random.randn(len(x),1) # Noisy line y=ax 
atrue = np.linalg.lstsq(x,b,rcond=None) [0] # Least-squares a 


b[-1] = -5.5 # Introduce outlier 
acorrupt = np.linalg.lstsq(x,b, rcond=None) [0] # New slope 


## L1 optimization to reject outlier 
def L1 norm(a): 
return np.linalg.norm(ax*x-—b, ord=1) 


a0 = acorrupt # initialize to L2 solution 
res = minimize(Ll_norm, a0) 
aLl = res.x[0] He abl 15 ropuüusSt 


Feature Selection and LASSO Regression 


Interpretability is important in statistical models, as these models are often 
communicated to a non-technical audience, including business leaders and pol- 
icy makers. Generally, a regression model is more interpretable if it has fewer 
terms that bear on the outcome, motivating yet another perspective on sparsity. 

The least absolute shrinkage and selection operator (LASSO) is an ¢; pe- 
nalized regression technique that balances model complexity with descriptive 
capability [702]. This principle of parsimony in a model is also a reflection of Oc- 
cam’s razor, stating that, among all possible descriptions, the simplest correct 
model is probably the true one. Since its inception by Tibshirani in 1996 [702], 
the LASSO has become a cornerstone of statistical modeling, with many mod- 
ern variants and related techniques [315] [348] |760]. The LASSO is closely related 
to the earlier non-negative garrote of Breimen [107], and is also related to earlier 
work on soft thresholding by Donoho and Johnstone [207] 208]. LASSO may be 
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thought of as a sparsity-promoting regression that benefits from the stability of 
the 4, regularized ridge regression [332], also known as Tikhonov regulariza- 
tion. The elastic net is a frequently used regression technique that combines the 
l and ¢) penalty terms from LASSO and ridge regression [777]. Sparse regres- 
sion will be explored in more detail in Chapter {4} 

Given a number of observations of the predictors and outcomes of a system, 
arranged as rows of a matrix A and a vector b, respectively, regression seeks to 
find the relationship between the columns of A that is most consistent with the 
outcomes in b. Mathematically, this may be written as 


Ax =b. (3.14) 


Least-squares regression will tend to result in a vector x that has non-zero coef- 
ficients for all entries, indicating that all columns of A must be used to predict 
b. However, we often believe that the statistical model should be simpler, indi- 
cating that x may be sparse. The LASSO adds an 4; penalty term to regularize 
the least-squares regression problem, i.e., to prevent overfitting: 


x = argmin || Ax’ — blp + Al|x|l1. (3.15) 


Typically, the parameter A is varied through a range of values and the fit is 
validated against a test set of holdout data. If there is not enough data to have a 
sufficiently large training set and test set, it is common to repeatedly train and 
test the model on random selection of the data (often 80 % for training and 20 % 
for testing), resulting in a cross-validated performance. This cross-validation pro- 
cedure enables the selection of a parsimonious model that has relatively few 
terms and avoids overfitting. 

Many statistical systems are over-determined, as there are more observa- 
tions than candidate predictors. Thus, it is not possible to use standard com- 
pressed sensing, as measurement noise will guarantee that no exact sparse so- 
lution exists that minimizes || Ax — b||2. However, the LASSO regression works 
well with over-determined problems, making it a general regression method. 
Note that an early version of the geometric picture in Fig. to explain the 
sparsity-promoting nature of the /;-norm was presented in Tibshirani’s 1996 
paper [702]. 

LASSO regression is frequently used to build statistical models for disease, 
such as cancer and heart failure, since there are many possible predictors, in- 
cluding demographics, lifestyle, biometrics, and genetic information. Thus, LASSO 
represents a clever version of the kitchen-sink approach, whereby nearly all pos- 
sible predictive information is thrown into the mix, and afterwards these are 
then sifted and sieved through for the truly relevant predictors. 

As a simple example, we consider an artificial data set consisting of 100 
observations of an outcome, arranged in a vector b € R. Each outcome in b 
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is given by a combination of exactly two out of 10 candidate predictors, whose 


observations are arranged in the rows of a matrix A € R'0*1°; 
A = randn (100,10); % Matrix of possible predictors 
Ol OF ele; Oe 0: o ei Ol ROO a 2 nonzero predictors 
b = Axx + 2xrandn (100,1); % Observations (with noise) 


The vector x is sparse by construction, with only two non-zero entries, and we 
also add noise to the observations in b. The least-squares regression is: 


>>xL2 = pinv (A) xb 


zie = O0R02 32 
05 S505 

hs Seal 
Ope AR a 

om 2912 

=0)0 525 

haa AO 
TOROA 
0.0413 

=O USOC 


Note that all coefficients are non-zero. Implementing the LASSO, with 10-fold 
cross-validation, is a single straightforward command in MATLAB: 


|| [XL1 FitInfo] = lasso(A,b,’CV’,10); 


The lasso command sweeps through a range of values for À, and the result- 
ing x are each stored as columns of the matrix in XL1. To select the most parsi- 
monious model that describes the data while avoiding overfitting, we may plot 
the cross-validated error as a function of A, as in Fig. 


|| LassoPlot (XL1,FitInfo,’PlotType’,’CV’) 


The green point is at the value of à that minimizes the cross-validated mean- 
square error, and the blue point is at the minimum cross-validated error plus 
one standard deviation. The resulting model is found via FitInfo.Index1SE: 


>> XLI = SCs Fre Inte. Iindexi Sr) 


xL1 = 0 
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Figure 3.14: Output of lassoPlot command to visualize cross-validated mean- 
squared error (MSE) as a function of À. 


| 0 


Note that the resulting model is sparse and the correct terms are active. 
However, the regression values for these terms are not accurate, and so it may 
be necessary to de-bias the LASSO by applying a final least-squares regression 
to the non-zero coefficients identified: 
>>xL1DeBiased = pinv(A(:,abs(xL1)>0))xb 

xii DeBblased = 2.0980 
=e Ole ral 


In Python, the LASSO and associated analysis functions are also simple: 


from sklearn import linear_model 

from sklearn import model_selection 

reg = linear model .LlassoCv (cv=10) .f1t (A, b) 

lasso = linear_model.Lasso(random_state=0, max_iter=10000) 
alphas = np.logspace(-4, -0.5, 30) 


tuned_parameters = [{’alpha’: alphas}] 

clf = model_selection.GridSearchCV(lasso, tuned_parameters, 
cv=10, refit=False) 

(Gules CIE(A D) 


Related plotting commands to reproduce this example in Python are provided 
on the book’s GitHub. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


142 CHAPTER 3. SPARSITY AND COMPRESSED SENSING 
(S) s 


< 


—— 


| Flatten Person 7 50% 


0 100 200 300 400 500 600 
y 


Classi 
| Downsample E | pi 
- al _ 
0.8 + 
0.6 
€k 
0.4L 
0.2 + 
[a 0 E 
Test image i ° Person #k ° 
(person 7) 


Figure 3.15: Schematic overview of sparse representation for classification. 


3.6 Sparse Representation 


Implicit in our discussion on sparsity is the fact that, when high-dimensional 
signals exhibit low-dimensional structure, they admit a sparse representation in 
an appropriate basis or dictionary. In addition to a signal being sparse in an 
SVD or Fourier basis, it may also be sparse in an overcomplete dictionary whose 
columns consist of the training data itself. In essence, in addition to a test signal 
being sparse in generic feature library U from the SVD, X = UN V%, it may also 
have a sparse representation in the dictionary X. 

Wright et al. demonstrated the power of sparse representation in a dic- 
tionary of test signals for robust classification of human faces, despite signifi- 
cant noise and occlusions. The so-called sparse representation for classification 
(SRC) has been widely used in image processing, and more recently to classify 
dynamical regimes in nonlinear differential equations 569]. 

The basic schematic of SRC is shown in Fig. where a library of images 
of faces is used to build an overcomplete library ©. In this example, 30 images 
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Figure 3.16: Sparse representation for classification demonstrated using a li- 
brary of faces. A clean test image is correctly identified as the person 7 in the 
library. 


are used for each of 20 different people in the Yale B database, resulting in 600 
columns in ©. To use compressed sensing, i.e., 4;-minimization, we need © to 
be under-determined, and so we downsample each image from 192 x 168 to 12x 
10, so that the flattened images are 120-component vectors. The algorithm used 
to downsample the images has an impact on the classification accuracy. A new 
test image y corresponding to class c, appropriately downsampled to match the 
columns of O, is then sparsely represented as a sum of the columns of © using 
the compressed sensing algorithm. The resulting vector of coefficients s should 
be sparse, and ideally will have large coefficients primarily in the regions of the 
library corresponding to the correct person in class c. The final classification 
stage in the algorithm is achieved by computing the £ reconstruction error 
using the coefficients in the s vector corresponding to each of the categories 
separately. The category that minimizes the 4> reconstruction error is chosen 
for the test image. Figures |3.16}3.19]show the use of SRC to correctly identify 
the correct person from images with different noise and corruption. 

The entire code to reproduce this example in MATLAB and Python is avail- 
able on the book’s GitHub. 
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Figure 3.17: Sparse representation for classification demonstrated on example 
face from person 7 occluded by a fake mustache. 
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Figure 3.18: Sparse representation for classification demonstrated on example 
image with 30% occluded pixels (randomly chosen and uniformly distributed). 
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Figure 3.19: Sparse representation for classification demonstrated on example 
with white noise added to image. 


3.7 Robust Principal Component Analysis (RPCA) 


As mentioned earlier in Section [.5} least-squares regression models are highly 
susceptible to outliers and corrupted data. Principal component analysis (PCA) 
suffers from the same weakness, making it fragile with respect to outliers. To 
ameliorate this sensitivity, Candès et al. have developed a robust principal 
component analysis (RPCA) that seeks to decompose a data matrix X into a 
structured low-rank matrix L and a sparse matrix S containing outliers and 
corrupt data: 

X=L+S. (3.16) 


The principal components of L are robust to the outliers and corrupt data in 
S. This decomposition has profound implications for many modern problems 
of interest, including video surveillance (where the background objects appear 
in L and foreground objects appear in S), face recognition (eigenfaces are in L 
and shadows, occlusions, etc. are in S), natural language processing and latent 
semantic indexing, and ranking problems) 


“The ranking problem may be thought of in terms of the Netflix prize for matrix completion. 
In the Netflix prize, a large matrix of preferences is constructed, with rows corresponding to 
users and columns corresponding to movies. This matrix is sparse, as most users only rate a 
handful of movies. The Netflix prize seeks to accurately fill in the missing entries of the matrix, 
revealing the likely user rating for movies the user has not seen. 
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Mathematically, the goal is to find L and S that satisfy the following: 


min rank(L) + ||S||o subjectto L+S =X. (3.17) 


However, neither the rank(L) nor the ||S]||ọ terms are convex, and this is not a 
tractable optimization problem. Similar to the compressed sensing problem, it 
is possible to solve for the optimal L and S with high probability using a convex 


relaxation of (3.17): 
min IL] + A/S], subjectto L+S=X. (3.18) 


Here, || - ||, denotes the nuclear norm, given by the sum of singular values, 
which is a proxy for rank. Remarkably, the solution to converges to the 
solution of (3.17) with high probability if A = 1/,/max(n,m), where n and m 


are the dimensions of X, given that L and S satisfy the following conditions: 
(a) Lis not sparse; and 


(b) S is not low-rank; we assume that the entries are randomly distributed so 
that they do not have low-dimensional column space. 


The convex problem in (3.17) is known as principal component pursuit (PCP), 
and may be solved using the augmented Lagrange multiplier (ALM) algorithm. 
Specifically, an augmented Lagrangian may be constructed: 


£(L,8,¥) = |El + AlSlh + (Y; X-L-S)+ IX -L-S|}  G.19) 


A general solution would solve for the L, and S, that minimize £, update the 
Lagrange multipliers Y;,; = Y; + u(X — L; — Sz), and iterate until the solution 
converges. However, for this specific system, the alternating directions method 
(ADM) provides a simple procedure to find L and S. 

First, a shrinkage operator S,(«) = sign(x) max(|| — 7,0) is constructed. In 
MATLAB, the function shrink is defined as 
function out = shrink (X,tau) 


out = sign (X) .*max (abs (X)-tau,0); 
end 


In Python, the function shrink is defined as 


def shrink (xXx, tau): 
Y = np.abs (x) tau 
return np.sign(X) * np.maximum(Y,np.zeros_like(Y) ) 


Next, the singular value threshold operator SVT,(X) = US,(%)V* is con- 
structed. In MATLAB, the function SVT is defined as 
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function out = SVT(X,tau) 
[U,S,V] = svd(X,’econ’); 
out = Us*eshrink(S,tau) *V'; 
end 


In Python, the function SVT is defined as 


def SVT (X,tau): 
U,S,VT = np.linalg.svd(X, full_matrices=0) 
Cut = UC np diag (shrink (S r tau) )) € VT 
return out 


Finally, it is possible to use the S, and SVT operators iteratively to solve for 
L and S as in CodeB.5] 


Code 3.5: [MATLAB] RPCA using alternating directions method (ADM). 


function [L,S] = RPCA(X) 
[nl,n2] = size(X); 

mu = nlx*n2/ (4*sum(abs (X(:)))); 
lambda = 1/sqrt (max(n1,n2)); 
thresh = le-7*norm(X,’ fro’); 

L = zeros (size(X)); 

S = zeros (size(X)); 

Y = zeros (size(X)); 

count Ore 


while ( (norm(X-L-S,’ fro’) >thresh) && (count<1000) ) 
L = SVT (X-S+(1/mu) =Y, 1/mw) ; 

S = shrink (X-L+ (1/mu) «Y, lambda/mu) ; 

ne Ne Sr MUX L o), 

count = count r 1 


end 


Code 3.5: [Python] RPCA using alternating directions method (ADM). 


def RPCA (X): 
nl,n2 = X.shape 
mu = nl«n2/(4*np.sum(np.abs (X.reshape(-1)))) 
lambd = 1/np.sqrt (np.maximum(n1,n2) ) 
thresh = 10«*x*(-7) s np.linalg.norm(X) 


S = np.zeros_like (X) 
i np.zeros_like (X) 
L = np.zeros_like (xX) 


count = 0 
while (np.linalg.norm(X-L-S) > thresh) and (count < 
1000): 
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L = SVT (X-S+(1/mu) *«Y,1/mu) 
S shrink (X-L+(1/mu)*Y, lambd/mu) 
Y= eS mu (nS) 
count r I 
return L,S 


RPCA is demonstrated on the eigenfaces data with the following code in MAT- 
LAB 


load allFaces.mat 


X = faces(:,l:nfaces(1)); 
[LS] = RECA(X); 

and in Python 
X = faces[:,:nfaces[0]] 


L/S = RPCA (X) 


In this example, the original columns of X, along with the low-rank and 
sparse components, are shown in Fig. Notice that in this example, RPCA 
effectively fills in occluded regions of the image, corresponding to shadows. In 
the low-rank component L, shadows are removed and filled in with the most 
consistent low-rank features from the eigenfaces. This technique can also be 
used to remove other occlusions such as fake mustaches, sunglasses, or noise. 


3.8 Sparse Sensor Placement 


Until now, we have investigated signal reconstruction in a generic basis, such 
as Fourier or wavelets, with random measurements. This provides consider- 
able flexibility, as no prior structure is assumed, except that the signal is sparse 
in a known basis. For example, compressed sensing works equally well for re- 
constructing an image of a mountain, a face, or a cup of coffee. However, if we 
know that we will be reconstructing a human face, we can dramatically reduce 
the number of sensors required for reconstruction or classification by optimiz- 
ing sensors for a particular feature library Y, = U built from the SVD. 

Thus, it is possible to design tailored sensors for a particular library, in con- 
trast to the previous approach of random sensors in a generic library. Near- 
optimal sensor locations may be obtained using fast greedy procedures that 
scale well with large signal dimension, such as the matrix OR factorization. 
The following discussion will closely follow Manohar et al. and B. Brun- 
ton et al. [122], and the reader is encouraged to find more details there. Simi- 
lar approaches will be used for efficient sampling of reduced-order models in 
Chapter[13| where they are termed hyper-reduction. There are also extensions of 
the following for sensor and actuator placement in control [484], based on the 
balancing transformations discussed in Chapter [9] 
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Figure 3.20: Output of RPCA for images in the Yale B database. 


Optimizing sensor locations is important for nearly all downstream tasks, 
including classification, prediction, estimation, modeling, and control. How- 
ever, identifying optimal locations involves a brute-force search through the 
combinatorial choices of p sensors out of n possible locations in space. Recent 
greedy and sparse methods are making this search tractable and scalable to 
large problems. Reducing the number of sensors through principled selection 
may be critically enabling when sensors are costly, and may also enable faster 
state estimation for low-latency, high-bandwidth control. 

The examples in this section are in MATLAB. However, extensive Python 
code for sparse sensing is available at the following: 


Sparse Sensor Placement for Reconstruction 


The goal of optimized sensor placement in a tailored library Y, € R"*" is to 
design a sparse measurement matrix C € R?*", so that inversion of the linear 
system of equations 


y = CW,a=0a (3.20) 


is as well conditioned as possible. In other words, we will design C to minimize 
the condition number of CW, = 6, so that it may be inverted to identify the 
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C 


Figure 3.21: Least-squares with r sparse sensors provides a unique solution to 
a, hence x. Reproduced with permission from Manohar et al. [481]. 


low-rank coefficients a given noisy measurements y. The condition number of 
a matrix @ is the ratio of its maximum and minimum singular values, indicating 
how sensitive matrix multiplication or inversion is to errors in the input. Larger 
condition numbers indicate worse performance inverting a noisy signal. The 
condition number is a measure of the worst-case error when the signal a is in 
the singular vector direction associated with the minimum singular value of 0, 
and noise is added that is aligned with the maximum singular vector: 


O(a + Ea) = Omina + Omax€a- (3.21) 


Thus, the signal-to-noise ratio decreases by the condition number after map- 
ping through 0. We therefore seek to minimize the condition number through 
a principled choice of C. This is shown schematically in Fig.|3.21]for p = r. 
When the number of sensors is equal to the rank of the library, i.e., p = r, 
then @ is a square matrix, and we are choosing C to make this matrix as well 
conditioned for inversion as possible. When p > r, we seek to improve the con- 
dition of M = 8790, which is involved in the pseudo-inverse. It is possible to 
develop optimization criteria that optimize the minimum singular value, the 
trace, or the determinant of 0 (respectively M). However, each of these opti- 
mization problems is NP-hard, requiring a combinatorial search over the pos- 
sible sensor configurations. Iterative methods exist to solve this problem, such 
as convex optimization and semi-definite programming 353], although 
these methods may be expensive, requiring iterative n x n matrix factorizations. 
Instead, greedy algorithms are generally used to approximately optimize the 
sensor placement. These gappy POD methods originally relied on random 
subsampling. However, significant performance advances were demonstrated 
by using principled sampling strategies for reduced-order models (ROMs) 
in fluid dynamics and ocean modeling [767]. More recently, variants of the 
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so-called empirical interpolation method (EIM, DEIM and Q-DEIM) 215] 
have provided near-optimal sampling for interpolative reconstruction of non- 
linear terms in ROMs. 


Random Sensors. In general, randomly placed sensors may be used to esti- 
mate mode coefficients a. However, when p = r and the number of sensors is 
equal to the number of modes, the condition number is typically very large. 
In fact, the matrix © is often numerically singular and the condition number 
is near 10'°. Oversampling, as in Section rapidly improves the condition 
number, and even p = r + 10 usually has much better reconstruction perfor- 
mance. 


OR Pivoting for Sparse Sensors. The greedy matrix OR factorization with 
column pivoting of Y7, explored by Drmac and Gugercin for reduced- 
order modeling, provides a particularly simple and effective sensor optimiza- 
tion. The QR pivoting method is fast, simple to implement, and provides nearly 
optimal sensors tailored to a specific SVD/POD basis. OR factorization is op- 
timized for most scientific computing libraries, including MATLAB, LAPACK, 
and NumPy. In addition, QR can be sped-up by ending the procedure after the 
first p pivots are obtained. 

The reduced matrix QR factorization with column pivoting decomposes a 
matrix A € R™*” into a unitary matrix Q, an upper triangular matrix R and 
a column permutation matrix CT such that ACT = QR. The pivoting proce- 
dure provides an approximate greedy solution method to minimize the matrix 
volume, which is the absolute value of the determinant. QR column pivoting 
increments the volume of the submatrix constructed from the pivoted columns 
by selecting a new pivot column with maximal 2-norm, then subtracting from 
every other column its orthogonal projection onto the pivot column. 

Thus OR factorization with column pivoting yields r point sensors (pivots) 
that best sample the r basis modes W;: 


PICT = QR. (3.22) 


Sensor selection based on the pivoted QR algorithm is quite simple for p = r. 
In MATLAB, the code is 


lO, R pivot] = gr(Psi r, vector); < QOR sensor selection 
C = zeros (p,n); 
for j= 1:5 
C(j,pivot (j))=1; 
end 
In Python the code is 
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Original 
a: 


Figure 3.22: (left) Original image and locations of p = 100 QR sensors in a 
r = 100 mode library. (middle) Reconstruction with QR sensors. (right) Re- 
construction with random sensors. 


from scipy import linalg 
O R pivote — linalg.gr (Psi r.r Pivoting- ITrues) 
Cr= Dp. zeros like(PsSi r-T) 


Cipivoec i: eie T 
for k in range(r): 
Cik, pivotikii — I 


It may also be advantageous to use oversampling [557], choosing more sensors 
than modes, so p > r. In this case, there are several strategies, and random 
oversampling is a robust choice. 


Example: Reconstructing a Face with Sparse Sensors 


To demonstrate the concept of signal reconstruction in a tailored basis, we will 
design optimized sparse sensors in the library of eigenfaces from Section 
Figureĵ.22|shows the QR sensor placement and reconstruction, along with the 
reconstruction using random sensors. We use p = 100 sensors ina r = 100 
mode library. This code assumes that the faces have been loaded and the sin- 
gular vectors are in a matrix U. Optimized QR sensors result in a more accurate 
reconstruction, with about three times less reconstruction error. In addition, the 
condition number is orders of magnitude smaller than with random sensors. 
Both QR and random sensors may be improved by oversampling. From the 
OR sensors C based on W,, it is possible to reconstruct an approximate image 
from these sensors. In MATLAB, the reconstruction is given by 


Theta = C*Psi_r; 
Sf Taceo paot e) 
a = Thera\y; 


Measure at pivot locations 
Estimate coefficients 
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Figure 3.23: Schematic illustrating SVD for feature extraction, followed by lin- 
ear discriminant analysis (LDA) for the automatic classification of data into two 
categories A and B. Reproduced with permission from Bai et al. [40]. 


o 


faceRecon = Psi_r * a; % Reconstruct face 


In Python, the reconstruction is given by 


Theta isyeeloie (G7 PSr E) 


y faces pivot) # Measure at pivot locations 
a = np.dot(np.linalg.pinv(Theta),y) # Estimate coefficients 
faceReeon — np. cok (Psi r 7 a) # Reconstruct face 


Sparse Classification 


For image classification, even fewer sensors may be required than for recon- 
struction. For example, sparse sensors may be selected that contain the most 
discriminating information to characterize two categories of data [122]. Given 
a library of r SVD modes W,,, it is often possible to identify a vector w € R” in 
this subspace that maximally distinguishes between two categories of data, as 
described in Section|5.6]and shown in Fig. Sparse sensors s that map into 
this discriminating direction, projecting out all other information, are found by 


s=argmin||s’||, subjectto Pfs = w. (3.23) 


This sparse sensor placement optimization for classification (SSPOC) is shown 
in Fig.|3.24|for an example classifying dogs versus cats. The library Y, contains 
the first r eigen-pets and the vector w identifies the key differences between dogs 
and cats. Note that this vector does not care about the degrees of freedom that 
characterize the various features within the dog or cat clusters, but rather only 
the differences between the two categories. Optimized sensors are aligned with 
regions of interest, such as the eyes, nose, mouth, and ears. 
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Sensors 


W,w = (sinred) 
Figure 3.24: Sparse sensor placement optimization for classification (SSPOC) 


illustrated for optimizing sensors to classify dogs and cats. Reproduced with 
permission from B. Brunton et al. [122]. 
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Homework 


Exercise 3-1. Load the image dog.jpg and convert to grayscale. We will repeat 
Exercise 2-1, using the FFT to compress the image at different compression ra- 
tios. However, now, we will compare the error versus compression ratio for 
the image downsampled at different resolutions. Compare the original image 
(2000 x 1500) and downsampled copies of the following resolutions: 1000 x 750, 
500 x 375, 200 x 150, and 100 x 75. Plot the error versus compression ratio for 
each image resolution on the same plot. Explain the observed trends. 


Exercise 3-2. This example will explore geometry and sampling probabilities in 
high-dimensional spaces. Consider a two-dimensional square dart board with 
length L = 2 on both sides and a circle of radius R = 1 in the middle. Write a 
program to throw 10 000 darts by generating a uniform random z and y position 
on the square. Compute the radius for each point and compute what fraction 
land inside the circle (i.e., how many have radius < 1). Is this consistent with 
your expectation based on the area of the circle and the square? 


Repeat this experiment, throwing 10000 darts randomly (sampled from a uni- 
form distribution) on an N-dimensional cube (length L = 2) with an N-dimensional 
sphere inside (radius R = 1), for N = 2 through N = 10. For a given N, what 
fraction of the points land inside the sphere. Plot this fraction versus N. Also 
compute the histogram of the radii of the randomly sampled points for each N 
and plot these. What trends do you notice in the data? 


Exercise 3-3. This exercise will explore the relationship between the sparsity K, 
the signal size n, and the number of samples p in compressed sensing. 


(a) For n = 1000 and K = 5, create a K-sparse vector s of Fourier coefficients 
in a Fourier basis Y. For each p from 1 to 100, create a Gaussian random 
sampling matrix C € R?*” to create a measurement vector y = CWs. Use 
compressed sensing based on this measurement to estimate s. For each 
p, repeat this with at least 10 realizations of the random measurement 
matrix C. Plot the average relative error of ||S — s||2/||s|| versus p; it may 
be helpful to visualize the errors with a box-and-whisker plot. Explain the 
trends. Also plot the average ¢; and 4o error versus p. 


(b) Repeat the above experiment for K = 1 through K = 20. What changes? 


(c) Now repeat the above experiment for K = 5, varying the signal size using 
n = 100, n = 500, n = 1000, n = 2000, and n = 5000. 


Exercise 3-4. Repeat the above exercise with a uniformly sampled random sam- 
ple matrix. Also repeat with a Bernoulli random matrix and a matrix that com- 
prises random single pixels. Plot the average relative errors for these different 
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sampling matrices on the same plot (including the plot for Gaussian random 
sampling). Discuss the trends. 


Exercise 3-5. Generate a DFT matrix Y for n = 512. We will use this as a basis 
for compressed sensing, and we will compute the incoherence of this basis and 
different measurement matrices. For p = 16, create different random measure- 
ment matrices C given by Gaussian random measurements, Bernoulli random 
measurements, and random single-pixel measurements. For each matrix, nor- 
malize the length of each row to 1. Now, for each measurement matrix type, 
compute the incoherence ju(C, Y). Repeat this for many random instances of 
each C matrix type and compare the histogram of incoherence values for each 
matrix type. Further, compare the histogram of each inner product y/n(cx, w;) 
for each matrix type. Discuss any trends and the implications for compressed 
sensing with these measurement matrices. Are there other factors that are rele- 
vant for the sensing matrix? 


Exercise 3-6. This exercise will explore sparse representation from Section [3.6] 
to estimate a fluid flow field, following Callaham et al. [147]. Load the cylin- 
der flow data set. Coarsen each flow field by a factor of 20 in each direction 
using imresize, and build a library of these coarsened measurements (i.e., a 
matrix, where each column contains these downsampled measurements). Plot 
a movie of the flow field in these new coordinates. Now, pick a column of the 
full flow field matrix and add Gaussian random noise to this field. Downsam- 
ple the noisy field by a factor of 20 and use SRC to find the closest downsam- 
pled library element. Then use this column of the full flow field library as your 
reconstructed estimate. 


Try this approach with different levels of noise added to the original flow field. 
See how much noise is required before the method breaks. Try different ap- 
proaches to creating a low-dimensional representation of the image (i.e., in- 
stead of downsampling, you can measure the flow field in a small 5 x 10 patch 
and use this as the low-dimensional feature for SRC). 


Exercise 3-7. This exercise will explore RPCA from Section [3.7|for robust flow 
field analysis, following Scherl et al. [631]. 


(a) Load the cylinder flow data set. Compute the SVD as in Exercise 1-7 and 
plot the movie of the flow. Also plot the singular values and first 10 sin- 
gular vectors. 


(b) Now, contaminate a random 10% of the entries of the data matrix with 
salt-and-pepper noise. The contaminated points should not be the same 
for each column, and the salt-and-pepper noise should be +57, where 
n is the standard deviation of the entire data set. Compute the SVD of 
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(c) 


(d) 


the contaminated matrix and plot the movie of the flow along with the 
singular values and first 10 singular vectors. 


Clean the contaminated data set by applying RPCA and keeping the low- 
rank portion L. Again, compute the SVD of the decontaminated matrix L 
and plot the movie of the flow along with the singular values and first 10 
singular vectors. Compare these with the results from the original clean 
and contaminated data sets. 


Try to clean the data by applying the Gavish—Donoho threshold to the 
data matrix contaminated with salt-and-pepper noise. Does this work? 
Explain why or why not. 


Exercise 3-8. This exercise will explore the sparse sensor selection approach 
based on QR from Section 3.8} 


(a) 


(b) 


(c) 


Load the Yale B faces data set. Randomly choose one person to omit from 
the data matrix and compute the SVD of the remaining data. Compute the 
QR sensor locations for p = 100 using the first r = 100 modes of this SVD 
basis U. Use these sensor locations to reconstruct the images of the person 
that was left out of the matrix for the SVD. Compare the reconstruction 
error using these QR sensor locations with reconstruction using p = 100 
randomly chosen sensors, as in Fig. 


Now, repeat this experiment 36 times, each time omitting a different per- 
son from the data before computing the SVD, and use the sensor locations 
to reconstruct the images of the omitted person. This will provide enough 
reconstruction errors on which to perform statistics. For each experiment, 
also compute the reconstruction error using 36 different configurations of 
p = 100 random sensors. Plot the histograms of the error for the QR and 
random sensors, and discuss. 


Finally, repeat the above experiments for different sensor number p = 10 
through p = 200 in increments of 10. Plot the error distributions versus 
p for QR and random sensor configurations. Because each value of p cor- 
responds to many reconstruction errors, it would be best to plot this as a 
box-and-whisker plot or as a violin plot. 


Exercise 3-9. In the exercise above, for p = 100, compare the reconstruction 
results using the p = 100 QR sensors to reconstruct in the first r = 100 modes, 
versus using the same sensors to reconstruct in the first r = 90 SVD modes. 
Is one more accurate than the other? Compare the condition number of the 
100 x 100 and 100 x 90 matrices obtained by sampling the p rows of the r = 100 
and r = 90 columns of U from the SVD. 
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Chapter 4 


Regression and Model Selection 


All of machine learning revolves around optimization. This includes regres- 
sion and model selection frameworks that aim to provide parsimonious and 
interpretable models for data [350]. Curve fitting is the most basic of regression 
techniques, with polynomial and exponential fitting resulting in solutions that 
come from solving the linear system 


Ax =b. (4.1) 


When the model is not prescribed, then optimization methods are used to select 
the best model. This changes the underlying mathematics for function fitting to 
either an over-determined or an under-determined optimization problem for 
linear systems given by 


argmin(||Ax — b]|2 + Ag(x)) or (4.2a) 


argmin g(x) subjectto ||/Ax —bll2 < €, (4.2b) 
where g(x) is a given penalization (with penalty parameter \ for over-determined 
systems). 

For over- and under-determined linear systems of equations, which result 
in either no solutions or an infinite number of solutions of (4.1), a choice of con- 
straint or penalty, which is also known as regularization, must be made in order 
to produce a solution. For instance, one can enforce a solution minimizing the 
smallest /:-norm in an under-determined system so that min g(x) = min ||x||2- 
More generally, when considering regression to nonlinear models, then the 
overall mathematical framework takes the more general form 

argmin(f(A,x,b)+Ag(x)) or (4.3a) 


x 


argmin g(x) subjectto f(A,x,b) <e, (4.3b) 


which are often solved using gradient descent algorithms. Indeed, this general 
framework is also at the center of deep learning algorithms. 
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(a) Overfitting (b) Underfitting 
= withhold 7 
z E 
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training 
Model complexity Model complexity 


Figure 4.1: Prototypical behavior of over- and underfitting of data. (a) For over- 
fitting, increasing the model complexity or training epochs (iterations) leads to 
improved reduction of error on training data while increasing the error on the 
withheld data. (b) For underfitting, the error performance is limited due to re- 
strictions on model complexity. These canonical graphs are ubiquitous in data 
science and of paramount importance when evaluating a model. 


In addition to optimization strategies, a central concern in data science is 
understanding if a proposed model has overfit or underfit the data. Thus cross- 
validation strategies are critical for evaluating any proposed model. Cross-validation 
will be discussed in detail in what follows, but the main concepts can be un- 
derstood from Fig. A given data set must be partitioned into training, val- 
idation, and withhold sets. A model is constructed from the training and val- 
idation data and finally tested on the withhold set. For overfitting, increasing 
the model complexity or training epochs (iterations) improves the error on the 
training set while leading to increased error on the withhold set. Figure [4.1{a) 
shows the canonical behavior of data overfitting, suggesting that the model 
complexity and/or training epochs be limited in order to avoid the overfitting. 
In contrast, underfitting limits the ability to achieve a good model as shown in 
Fig. |4.1{b). However, it is not always clear if you are underfitting or if the model 
can be improved. Cross-validation is of such paramount importance that it is 
automatically included in most machine learning algorithms in MATLAB. Im- 
portantly, the following mantra holds: If you do not cross-validate, you is dumb. 

The next few chapters will outline how optimization and cross-validation 
arise in practice, and will highlight the choices that need to be made in apply- 
ing meaningful constraints and structure to g(x) so as to achieve interpretable 
solutions. Indeed, the objective (loss) function f(-) and regularization g(-) are 
equally important in determining computationally tractable optimization strate- 
gies. Often times, proxy loss and regularization functions are chosen in order to 
achieve approximations to the true objective of the optimization. Such choices 
depend strongly upon the application area and data under consideration. 
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4.1 Classic Curve Fitting 


Curve fitting is one of the most basic and foundational tools in data science. 
From our earliest educational experiences in the engineering and physical sci- 
ences, least-squares polynomial fitting was advocated for understanding the 
dominant trends in real data. Adrien-Marie Legendre used least-squares as 
early as 1805 to fit astronomical data [436], with Gauss more fully developing 
the theory of least-squares as an optimization problem in a seminal contribu- 
tion of 1821 [264]. Curve fitting in such astronomical applications was highly ef- 
fective given the simple elliptical orbits (quadratic polynomial functions) man- 
ifest by planets and comets. Thus one can argue that data science has long been 
a cornerstone of our scientific efforts. Indeed, it was through Kepler’s access to 
Tycho Brahe’s state-of-the-art astronomical data that he was able, after 11 years 
of research, to produce the foundations for the laws of planetary motion, posit- 
ing the elliptical nature of planetary orbits, which were clearly best-fit solutions 
to the available data [380]. 

A broader mathematical viewpoint of curve fitting, which we will advocate 
throughout this text, is regression. Like curve fitting, regression attempts to esti- 
mate the relationship among variables using a variety of statistical tools. Specif- 
ically, one can consider the general relationship between independent variables 
X, dependent variables Y, and some unknown parameters /3: 


Y = f(X, 8), (4.4) 


where the regression function f(-) is typically prescribed and the parameters 
6 are found by optimizing the goodness-of-fit of this function to data. In what 
follows, we will consider curve fitting as a special case of regression. Impor- 
tantly, regression and curve fitting discover relationships among variables by 
optimization. Broadly speaking, machine learning is framed around regression 
techniques, which are themselves framed around optimization based on data. 
Thus, at its absolute mathematical core, machine learning and data science re- 
volve around positing an optimization problem. Of course, the success of op- 
timization itself depends critically on defining an objective function to be opti- 
mized. 


Least-Squares Fitting Methods 


To illustrate the concepts of regression, we will consider classic least-squares 
polynomial fitting for characterizing trends in data. The concept is straight- 
forward and simple: use a simple function to describe a trend by minimizing 
the sum-square error between the selected function f(-) and its fit to the data. 
As we show here, classical curve fitting is formulated as a simple solution of 
Ax =b. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


4.1. CLASSIC CURVE FITTING 163 


Consider a set of n data points: 


(215i), (£2, 2), (x3, 3), aus (En, Yn). (4.5) 


Further, assume that we would like to find a best-fit line through these points. 
We can approximate the line by the function 


f(z) = Bix + Bo, (4.6) 


where the constants 6, and 62, which are the parameters of the vector 6 in (4.4), 
are chosen to minimize some error associated with the fit. The line fit gives the 
linear regression model Y = f(A, (3) = 61X + b2. Thus the function gives a linear 
model which approximates the data, with the approximation error at each point 
given by 

f(x) = Yr + Ex, (4.7) 


where yx is the true value of the data and E, is the error of the fit from this 
value. 

Various error metrics can be minimized when approximating with a given 
function f(x). The choice of error metric, or norm, used to compute a goodness- 
of-fit will be critical in this chapter. Three standard possibilities are often con- 
sidered, which are associated with the £4- (least-squares), ¢;-, and ¢,,-norms. 
These are defined as follows: 


Bly) = max |f (£k) — Ygl maximum error (læ), (4.8a) 
1 n 
E(f) = z > |f (ve) — Yel mean absolute error (41), (4.8b) 
k=1 
Le 1/2 
E(f) = (2 > |f (£k) — nf) least-squares error (2). (4.8c) 
k=1 


Such regression error metrics have been previously considered in Chapter 
but they will be considered once again here in the framework of model selec- 
tion. 

In addition to the above norms, one can more broadly consider the error 
based on the ,-norm: 


n 1/p 
E(f) = (: >. |f (tx) — nt) : (4.9) 


For different values of p, the best-fit line will be different. In most cases, the 
differences are small. However, when there are outliers in the data, the choice 
of norm can have a significant impact. 
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Figure 4.2: Line fits for the three different error metrics E,, E, and F>. In (a), 
the data has no outliers and the three linear models, although different, produce 
approximately the same model. With outliers, (b) shows that the predictions are 
significantly different. 


When fitting a curve to a set of data, the root-mean-square error (4.8f) is 
often chosen to be minimized. This is called a least-squares fit. Figure|4.2}depicts 
three line fits that minimize the errors Ex, E, and E> listed above. The E» 
error line fit is strongly influenced by the one data point that does not fit the 
trend. The E; and E> lines fit nicely through the bulk of the data, although their 
slopes are quite different in comparison to when the data has no outliers. The 
linear models for these three error metrics are constructed using MATLAB’s 
fminsearch command. The code for all three is given as follows. 


Code 4.1: [MATLAB] Regression for linear fit. 


Permin scaron feed! 7 lei < pao), 
p2-tminseanch C beee le Mi, Visa); 
PO- mins ear AE aes, (iil een), 


Scr — 00L TLT 
yl=polyval (p1,xf); y2=polyval(p2,xf); y3=polyval (p3,xf); 


Code 4.1: [Python] Regression for linear fit. 


Peepy ope ame a EE OaS EN 
Pee ey ope ame N E E x) aS EN 


r 
r 
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PSs = Sct py Ope Minne. Eman EE sO) ars ENN 


xf = np.arange(0,11,0.1) 
vi Mp pe lyva Elz); 2 — ni pO lanrad: (Reon ea 3 — p. 
polyval (p3,xf) 


For each error metric, the computation of the error metrics must be com- 
puted. The fminsearch command requires that the objective function for min- 
imization be given. For the three error metrics considered, this results in the 
following set of functions for fminsearch. 


Code 4.2: [MATLAB] Maximum error f. 


function E=fitl(x0,x,y) 
E=max (abs ( x0(1)*x+x0(2)-y )); 


Code 4.2: [MATLAB] Sum of absolute error 4. 


function E=fit2(x0,x,y) 
E=sum(abs( x0(1)*x+x0(2)-y )); 


Code 4.2: [MATLAB] Least-squares error %2. 


function E=fit3 (x0, x,y) 
E=sum(abs( x0(1)*x+x0(2)-y ).°2 ); 


Code 4.2: [Python] Fitting errors. 


def fitl(x0,t 
x, y=t 
return np.max(np.abs (x0[0]*x + x0[1]-y)) 

def £1672 (G0), ©): 
x, yt 
return np.sum(np.abs (x0[0]*x + x0O[1]-y)) 

def frees (x0): 
X; y 
return np.sum(np.power (np.abs(x0[0]*x + x0[1]-y),2)) 


<~ 


Finally, for the outlier data, an additional point is added to the data in order to 
help illustrate the influence of the error metrics on producing a linear regression 
model. 


Least-Squares Line 


Least-squares fitting to linear models has critical advantages over other norms 
and metrics. Specifically, the optimization is inexpensive, since the error can 
be computed analytically. To show this explicitly, consider applying the least- 
squares fit criteria to the data points (£k, Yp), where k = 1,2,3,...,n. To fit the 
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curve 
f(x) = bix + Bo (4.10) 
to this data, the error Fx is found by minimizing the sum: 
E(f) = X | (ae) — yel? = X (Bitr + Bo — ye)? (4.11) 
k=1 k=1 


Minimizing this sum requires differentiation. Specifically, the constants 3; and 
b2 are chosen so that a minimum occurs. Thus we require 0E2/06; = 0 and 
OE2/082 = 0. Note that although a zero derivative can indicate either a mini- 
mum or a maximum, we know this must be a minimum of the error since there 
is no maximum error, i.e., we can always choose a line that has a larger error. 
The minimization condition gives 


n 


= = Dn By ane =0, (4.12a) 
OE = 
BB, = 0: > 2(B12% + B2 — yx) = 0. (4.12b) 


Upon rearranging, a 2 x 2 system of linear equations is found for A and B: 


i eae Depa Th ( bı ) _ ( Daher Page ) — Ax=b. (4.13) 
Se n By ae 


This linear system of equations can be solved using the backslash command in 
MATLAB. Thus an optimization procedure is unnecessary since the solution is 
computed exactly from a 2 x 2 matrix. 

This method can be easily generalized to higher polynomial fits. In particu- 
lar, a parabolic fit to a set of data requires the fitting function 


f(a) = Bia? + Boa + Bs, (4.14) 


where now the three constants (6), G2, and (3; must be found. These can be 
solved for with the 3x3 system resulting from minimizing the error £(;, 52, 33) 
by taking 


OE, 
—~=(, 4.15a 
35, (4.15a) 
OF, 
—=0, 4.15b 
35; (4.15b) 
OF, 
—=0. 4.15c 
DBs (4.15c) 


In fact, any polynomial fit of degree k will yield a (k + 1) x (k +1) linear system 
of equations Ax = b whose solution can be found. 
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Data Linearization 


Although a powerful method, the minimization procedure for general fitting of 
arbitrary functions results in equations which are non-trivial to solve. Specifi- 
cally, consider fitting data to the exponential function 


f(x) = Bo exp(612). (4.16) 


The error to be minimized is 


n 


B2 (81, B2) = )>(B2exp(B12%) — ye). (4.17) 


k=1 


Applying the minimizing conditions leads to 


= =Q: 2 2(62 exp(b1£k) — Yk) Born exp(bı£k) = 0,  (4.18a) 
OE» n 
ah 0: >, 2(G2 exp(b1£k) — Yk) exp(bızk) = 0. (4.18b) 
This in turn leads to the 2 x 2 system 
Bo 5 Tp EXP(2812%) — 5 TkYk EXP(b1Tk) = 0, (4.19a) 
k=1 k=1 
Bo 5 exp(2ßızk) — 5 Yk exp bız) = 0. (4.19b) 
k=1 k=1 


This system of equations is nonlinear and cannot be solved in a straightforward 
fashion. Indeed, a solution may not even exist. Or many solutions may exist. 
Section|4.2]describes a possible iterative procedure, called gradient descent, for 
solving this nonlinear system of equations. 

To avoid the difficulty of solving this nonlinear system, the exponential fit 
can be linearized by the transformation 


Y =In(y), (4.20a) 
X =a, (4.20b) 
63 = In b2. (4.20c) 
Then the fit function 
f(x) = y = Bo exp(A:2) (4.21) 


can be linearized by taking the natural log of both sides so that 


lny = In($ exp(6:2)) = In Bp + In(exp(612)) = 83 + Bix 
zs Yanki (4.22) 
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By fitting to the natural log of the y data, 
(xi, yi) > (vi, nyi) = (Xa Yi), (4.23) 


the curve fit for the exponential function becomes a linear fitting problem, 
which is easily handled. Thus, if a transform exists that linearizes the data, then 
standard polynomial fitting methods can be used to solve the resulting linear 
system Ax = b. 


4.2 Nonlinear Regression and Gradient Descent 


Polynomial and exponential curve fitting admit analytically tractable, best-fit 
least-squares solutions. However, such curve fits are highly specialized and a 
more general mathematical framework is necessary for solving a broader set 
of problems. For instance, one may wish to fit a nonlinear function of the form 
f(x) = 8ı cos(Gox + 83) + 84 to a data set. Instead of solving a linear system of 
equations, general nonlinear curve fitting leads to a system of nonlinear equa- 
tions. The general theory of nonlinear regression assumes that the fitting func- 
tion takes the general form 


f(x) = f(x, B), (4.24) 


where the m < n fitting coefficients 6 € R™ are used to minimize the error. The 
root-mean-square error is then defined as 


n 


E(B) = X (F (2r; B) — ys)’, (4.25) 


k=1 


which can be minimized by considering the m x m system generated from min- 
imizing with respect to each parameter ĝ;: 


OE» ; 
a =0 forj=1,2,..., m. 4.26 
aB, J (4.26) 
In general, this gives the nonlinear set of equations 
X (F(t B) -ugg =? for j=1,2,3,...,m. (4.27) 
k=1 3 


There are no general methods available for solving such nonlinear systems. In- 
deed, nonlinear systems can have no solutions, several solutions, or even an 
infinite number of solutions. Most attempts at solving nonlinear systems are 
based on iterative schemes which require a good initial guess to converge to 
the global minimum error. Regardless, the general fitting procedure is straight- 
forward and allows for the construction of a best-fit curve to match the data. 
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Figure 4.3: Two objective function landscapes representing (a) a convex func- 
tion and (b) a non-convex function. Convex functions have many guarantees 
of convergence, while non-convex functions have a variety of pitfalls that can 
limit the success of gradient descent. For non-convex functions, local minima 
and an inability to compute gradient directions (derivatives that are near zero) 
make it challenging for optimization. 


In such a solution procedure, it is imperative that a reasonable initial guess be 
provided by the user. Otherwise, rapid convergence to the desired root may not 
be achieved. 

Figure|4.3/shows two example functions to be minimized. The first is a con- 
vex function (Fig. [4.3{a)). Convex functions are ideal in that guarantees of con- 
vergence exist for many algorithms, and gradient descent can be tuned to per- 
form exceptionally well for such functions. The second illustrates a non-convex 
function (Fig. |4.3{b)) and shows many of the typical problems associated with 
gradient descent, including the fact that the function has multiple local min- 
ima as well as flat regions where gradients are difficult to actually compute, 
i.e., the gradient is near zero. Optimizing this non-convex function requires 
a good guess for the initial conditions of the gradient descent algorithm, al- 
though there are many advances around gradient descent for restarting and 
ensuring that one is not stuck in a local minimum. Recent training algorithms 
for deep neural networks have greatly advanced gradient descent innovations. 
This will be further considered in Chapter [6]on neural networks. 


Gradient Descent 


For high-dimensional systems, we generalize the concept of a minimum or 
maximum, i.e., an extremum of a multi-dimensional function f(x). At an ex- 
tremum, the gradient must be zero, so that 


V f(x) =0. (4.28) 
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Since saddles exist in higher-dimensional spaces, one must test if the extremum 
point is a minimum or a maximum. The idea behind gradient descent, or steep- 
est descent, is to use the derivative information as the basis of an iterative algo- 
rithm that progressively converges to a local minimum point of f(x). 

To illustrate how to proceed in practice, consider the simple two-dimensional 
surface 


f(a,y) = 2 + 3y?, (4.29) 


which has a single minimum located at the origin (x,y) = 0. The gradient for 
this function is 
Vif (x) = DN ea pensr (4.30) 
Ox Oy 
where x and y are unit vectors in the x and y directions, respectively. 

Figure 4.4]illustrates the gradient steepest descent algorithm. At the initial 
guess point, the gradient V f(x) is computed. This gives the direction of steep- 
est descent towards the minimum point of f(x), i.e., the minimum is located 
in the direction given by —V f(x). Note that the gradient does not point at 
the minimum, but rather gives the locally steepest path for minimizing f(x). 
The geometry of the steepest descent suggests the construction of an algorithm 
whereby the next point in the iteration is picked by following the steepest de- 
scent, so that 


Xz41(0) =X, — ô Vif (Xx), (4.31) 


where the parameter 6 dictates how far to move along the gradient descent 

curve. This formula represents a generalization of a Newton method where the 

derivative is used to compute an update in the iteration scheme. In gradient 

descent, it is crucial to determine how much to step forward according to the 

computed gradient, so that the algorithm is always going downhill in an optimal 

way. This requires the determination of the correct value of 6 in the algorithm. 
To compute the value of 6, consider the construction of a new function 


F(8) = f(xXr+1(9)), (4.32) 


which must be minimized now as a function of ô. This is accomplished by com- 
puting OF'/06 = 0. Thus one finds 


OF 

FF = -V fen) Vf) = 0. (4.33) 
The geometrical interpretation of this result is the following: V f (x+) is the gra- 
dient direction of the current iteration point and V f (x;4+1) is the gradient direc- 
tion of the future point; thus ô is chosen so that the two gradient directions are 
orthogonal. 
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Figure 4.4: Gradient descent algorithm applied to the function f(x,y) = x? + 
3y°. The contours are plotted for each successive value (x,y) in the iteration 
algorithm given the initial guess (x, y) = (3,2). Note the orthogonality of each 
successive gradient in the steepest descent algorithm. 


For the example given above with f(x,y) = 2? + 3y”, we can compute this 
condition as follows: 
Xk+1 = Xk — Ô V f (Xk) = (1 — 26)a& + (1 — BO)yy. (4.34) 
This expression is used to compute 
F(8) = f(xk+1(8)) = (1 — 26)?a? + 3(1 — 68)°y’, (4.35) 
whereby its derivative with respect to ô gives 
F'(6) = —4(1 — 26)x? — 36(1 — 66)y’. (4.36) 
Setting F’(d) = 0 then gives 


x? + 9y? 
ee 4.37 
2x? + 54y? (aae 
as the optimal descent step length. Note that the length of ô is updated as the 
algorithm progresses. This gives us all the information necessary to perform 
the steepest descent search for the minimum of the given function. 
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Code 4.3: [MATLAB] Gradient descent example. 


%(1)=3; y(1)=2; @ initial guess 
EG) =) ia Say (de 2s theta lh muneel on: value 
for j=1:10 
rene D ee Ee a)) A (ee ai)) tea ae eee Cay 2; 
x (j+1)=(1-2x*del)*x(j); % update values 
(iin) iso clei) exe (Gay); 
f (j+1) =x (j+1) “243 %y (441) “2; 
if abs (f(j+1)-f(43))<10° (-6) % check convergence 
break 
end 
end 


Code 4.3: [Python] Gradient descent example. 
xO p= S Ol E= 2 eee guess 


h 
~ 
© 
pene 
ll 


x[O]**2 + 3xy[O]**2 # Initial function value 


for j in range(len(x)-1): 


Dol Ga alleen ae eee | 2)/(2*x[j]**2 + 54*y[J]**2) 


]**2) 
aL] (J 2x*Del)*xx[j] # update values 
Village lel ( 6*Del) xy[j] 
fi gt) = Sigel? 4 sey lable xe 
if np.abs(f[j+1]-f[j]) < 10**(-6): # check convergence 
X eS eee 
= eae 
el 32) 
break 


As is clearly evident, this descent search algorithm based on derivative infor- 
mation is similar to Newton’s method for root finding both in one dimension 
as well as in higher dimensions. Figure f4.4|shows the rapid convergence to the 
minimum for this convex function. Moreover, the gradient descent algorithm is 
the core algorithm of advanced iterative solvers such as the bi-conjugate gradi- 
ent descent method (bicgstab) and the generalized method of residuals (gmres) 
[295]. 

In the example above, the gradient could be computed analytically. More 
generally, given just data itself, the gradient can be computed with numeri- 
cal algorithms. The gradient command can be used to compute local or global 
gradients. Figure shows the gradient terms Of /Ox and ðf/Əy for the two 
functions shown in Fig. where the function f(x,y) is a two-dimensional 
function computed from a known function or directly from data. The output 
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(a) 


Figure 4.5: Computation of the gradient for the two functions illustrated in 
Fig. In the left panels, the gradient terms (a) Of /Ox and (c) Of /Oy are 
computed for Fig. |4.3{a), while the right panels compute these same terms for 
Fig. |4.3(b) in panels (b) and (d), respectively. The gradient command numeri- 
cally generates the gradient. 


are matrices containing the values of ð f /Ox and Of /Oy over the discretized do- 
main. The gradient can be used to approximate either local or global gradients 
to execute the gradient descent, as shown in Fig. 

The above discussion provides a rudimentary introduction to gradient de- 
scent. A wide range of innovations have attempted to speed up this domi- 
nant nonlinear optimization procedure, including alternating descent methods. 
Some of these will be discussed further in the neural network chapter where 
gradient descent plays a critical role in training a network. For now, one can 
see that there are a number of issues for this nonlinear optimization procedure, 
including determining the initial guess, and step size ô, and computing the gra- 
dient efficiently. 


Alternating Descent 


Another common technique for optimizing nonlinear functions of several vari- 
ables is the alternating descent method (ADM). Instead of computing the gradient 
in several variables, optimization is done iteratively in one variable at a time, 
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Figure 4.6: Gradient descent applied to the function featured in Fig. [4.3{b). 
Three initial conditions are shown: (29, yo) = {(0, 4), (—5, 0), (2, —5)} . The first 
of these (red circles) gets stuck in a local minimum, while the other two initial 
conditions (blue and magenta) find the global minimum. Interpolation of the 
gradient functions of Fig. [4.5]are used to update the solutions. 


as shown in Fig. For the example just demonstrated, this would make the 
computation of the gradient unnecessary. The basic strategy is simple: optimize 
along one variable at a time, seeking the minimum while holding all other vari- 
ables fixed. After passing through each variable once, the process is repeated 
until a desired convergence is reached. Note that the alternating descent only 
requires a line search along one variable at a time, thus potentially speeding up 
computations. Moreover, the method is derivative-free, which is attractive in 
many applications. 


4.3 Regression and Ax = b: Over- and Under-Determined 
Systems 


Curve fitting, as shown in the previous two sections, results in an optimiza- 
tion problem. In many cases, the optimization can be mathematically framed 
as solving the linear system of equations Ax = b. Before proceeding to dis- 
cuss model selection and the various optimization methods available for this 
problem, it is instructive to consider that, in many circumstances in modern 
data science, the linear system Ax = b is typically massively over- or under- 
determined. Over-determined systems have more constraints (equations) than 
unknown variables while under-determined systems have more unknowns than 
constraints. Thus in the former case, there are generally no solutions satisfying 
the linear system, and, instead, approximate solutions are found to minimize 
a given error. In the latter case, there are an infinite number of solutions, and 
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Figure 4.7: Alternating descent applied to the function in Fig. [4.3{b). Three ini- 
tial conditions are shown: (£o, yo) = {(4, 0), (0, —5), (—5, 2)}. The first of these 
(red circles) gets stuck in a local minimum, while the other two initial condi- 
tions (blue and magenta) find the global minimum. No gradients are computed 
to update the solution. Note the rapid convergence in comparison with Fig. 


"6 


-4 


some choice of constraint must be made in order to select an appropriate and 
unique solution. The goal of this section is to highlight two different norms 
(l and 41) used for optimization that are used to solve Ax = b for over- and 
under-determined systems. The choice of norm has a profound impact on the 
optimal solution achieved. 

Before proceeding further, it should be noted that the system Ax = b con- 
sidered here is a restricted instance of Y = f(X, 6) in (4.4). Thus the solution 
x contains the loadings or leverage scores characterizing the relationship between 
the input data A and outcome data b. A simple solution for this linear problem 
uses the Moore-Penrose pseudo-inverse A' from Section |1.4} 


x= A'b. (4.38) 


This operator is computed with the pinv(A) command in MATLAB. However, 
such a solution is restrictive, and a greater degree of flexibility is sought for 
computing solutions. Our particular aim in this section is to demonstrate the 
interplay in solving over- and under-determined systems using the 41- and 42- 
norms. 


Over-determined Systems 


Figure |4.8| shows the general structure of an over-determined system. As al- 
ready stated, there are generally no solutions that satisfy Ax = b. Thus, the 
optimization problem to be solved involves minimizing the error, for example, 
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Model terms Loadings Outcomes 
A x b 


Figure 4.8: Regression framework for over-determined systems. In this case, 
Ax = b cannot be satisfied in general. Thus, finding solutions for this system 
involves minimizing, for instance, the least-squares error || Ax — b|| subject to 
a constraint on the solution x, such as minimizing the /)-norm ||x||9. 


the least-squares ¢ error E», by finding an appropriate value of x: 


x = argmin ||Ax — bllo. (4.39) 


This basic architecture does not explicitly enforce any constraints on the 
loadings x. In order to both minimize the error and enforce a constraint on the 
solution, the basic optimization architecture can be modified to the following: 


x = argmin || Ax — bļļl2 + Aq|[x||1 + à2llxll2, (4.40) 


where the parameters A; and Az control the penalization of the £4- and /2-norms, 
respectively. This now explicitly enforces a constraint on the solution vector it- 
self, not just on the error. The ability to design the penalty by adding regulariz- 
ing constraints is critical for understanding model selection in the following. 

In the examples that follow, a particular focus will be given to the role of 
the £;-norm. The ¢;-norm, as already shown in Chapter [3| promotes sparsity so 
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that many of the loadings of the solution x are zero. This will play an impor- 
tant role in variable and model selection in the next section. For now, consider 
solving the optimization problem with Ay = 0. We use the open-source 
convex optimization package cvx in MATLAB to compute our solution 
to (4.40). The following code considers various values of the ¢, penalization 
in producing solutions to an over-determined system with 500 constraints and 
100 unknowns. 


Code 4.4: [MATLAB] Solutions for an over-determined system. 
n=500; m=100; 
A=rand (n,m) ; 
b=rand(n,1); 
xdag=pinv (A) «b; 


Lamm On 70rd. ona 
for j=1:3 


evx begin; 

variable x(m) 

minimize( norm (Axx-b,2) + lam(j)»norm(x,1) ); 
cvx_end; 


subplot (4,1, 4) ,bar (x) 
subplot (4,3,9+ 3), hist (x, 20) 
end 


Code 4.4: [Python] Solutions for an over-determined system. 
n = 500; m = 100 
A = np.random.rand(n,m) 
b = np.random.rand(n) 


xdag = np.linalg.pinv (A) @b 
Lam — sie airy) CO Oni; Ors) 


def reg_norm(x,A,b,lam): 
return np.linalg.norm(A@x-b,ord=2) + lam*np.linalg.norm( 
x,ord=1) 


fig,axs = plt.subplots (len(lam) ,2) 
res = minimize (reg_norm, args=(A,b,lam[j]),x0=xdag) 


x = FOS. xX 


Figure|4.9|highlights the results of the optimization process as a function of 
the parameter \,. It should be noted that the solution with à; = 0 is equivalent 
to the solution xdag produced by computing the pseudo-inverse of the matrix 
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Figure 4.9: Solutions to an over-determined system with 500 constraints and 
100 unknowns. Panels (a)-(c) show bar plots of the values of the loadings of 
the vectors x. Note that as the ¢; penalty is increased from (a) A; = 0 to (b) A; = 
0.1 to (c) Ay = 0.5, the number of zero elements of the vector increases, i.e., it 
becomes more sparse. A histogram of the loading values for (a)—(c) is shown 
in panels (d)-(f), respectively. This highlights the role that the ¢;-norm plays in 
promoting sparsity in the solution. 


A. Note that the ¢;-norm promotes a sparse solution where many of the com- 
ponents of the solution vector x are zero. The histograms of the solution values 
of x in Fig. /4.9(d)-(f) are particularly revealing, as they show the sparsification 
process for increasing A1. 

The regression for over-determined systems can be generalized to matrix 
systems as shown in Fig. In this case, the cvx command structure simply 
modifies the size of the matrix b and solution matrix x. Figure |4.10|shows the 
results of this over-determined matrix system for two different values of the 
added ¢; penalty. Note that the addition of the /;-norm sparsifies the solution 
and produces a matrix that is dominated by zero entries. The two examples in 
Figs. [4.9] and show the important role that the 42- and /;-norms have in 
generating different types of solutions. In the following sections of this book, 
these norms will be exploited to produce parsimonious models from data. 
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Figure 4.10: Solutions to an over-determined system Ax = b with 300 con- 
straints and 60 x 20 unknowns. Panels (a) and (b) show a plot of the values of 
the loadings of the matrix x with ¢; penalty (a) A; = 0 to (b) A; = 0.1. 


Under-determined Systems 


For underdetermined systems, there are an infinite number of possible solu- 
tions satisfying Ax = b. The goal in this case is to impose an additional con- 
straint, or set of constraints, whereby a unique solution is generated from the 
infinite possibilities. The basic mathematical structure is shown in Fig. As 
an optimization, the solution to the under-determined system can be stated as 


min ||x||, subjectto Ax = b, (4.41) 


where the p denotes the p-norm of the vector x. For simplicity, we consider the 
lə- and ¢,-norms only. As has already been shown for over-determined systems, 
the ¢;-norm promotes sparsity of the solution. 

We again use the convex optimization package cvx to compute our solu- 
tion to 4.42). The following code considers both ¢2 and £, penalization in pro- 
ducing solutions to an under-determined system with 20 constraints and 100 
unknowns. 


Code 4.5: [MATLAB] Solutions for an under-determined matrix system. 
||n=20; m=100 
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A=rand(n,m); b=rand(n,1); 


cvx_begin; 
variable x2 (m) 


minimize( norm(x2,2) ); 
subject to 

Axx2 == b; 

cvx_end; 


cvx_begin; 
variable x1 (m) 


minimize ( norm(xl, 1) ); 
subject to 

Axx1 == b; 

cvx_end; 


Code 4.5: [Python] Solutions for an under-determined matrix system. 


mw —) 20; mm 00 
A = np.random.rand(n,m) 
b = np.random. rand (n) 


def two_norm(x): 
return np.linalg.norm(x, ord=2) 


def one_norm(x): 
return np.linalg.norm(x, ord=1) 


constr = ((* type’ 2 %eq",, “fun: Lambda x: AQ x -~ DJ) 

x0 = np.random. rand (m) 

res = minimize(two_norm, x0, method=’SLSOP’,constraints= 
COonseEr) 

X2 — Les ox 

res = minimize(one_norm, x0, method=’SLSOP’,constraints= 
CONSEC) 

xl resix 


This code produces two solution vectors x2 and x1, which minimize the 42- and 
¢,-norm, respectively. Note the way that cvx allows one to impose constraints 
in the optimization routine. Figure |4.12|shows a bar plot and histogram of the 
two solutions produced. As before, the sparsity-promoting ¢;-norm yields a so- 
lution vector dominated be zeros. In fact, for this case, there are exactly 80 zeros 
for this linear system since there are only 20 constraints for the 100 unknowns. 


CHAPTER 4. REGRESSION AND MODEL SELECTION 


As with the over-determined system, the optimization can be modified to 


handle more general under-determined matrix equations, as shown in Fig. 
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Model terms Loadings Outcomes 
A x = b 
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Figure 4.11: Regression framework for under-determined systems. In this case, 
Ax = b can be satisfied. In fact, there are an infinite number of solutions. Thus 
pinning down a unique solution for this system involves minimizing a con- 
straint. For instance, from an infinite number of solutions, we choose the one 
that minimizes the ¢)-norm ||x||2, which is subject to the constraint Ax = b. 


The cvx optimization package may be used for this case as before with over- 
determined systems. The software engine can also work with more general p- 
norms as well as minimize with both ¢; and 4 penalties simultaneously. For 
instance, a common optimization modifies to the following: 


min(A;||x||1 + A2||x\|2) subjectto Ax = b, (4.42) 


where the weighting between åA; and A2 can be used to promote a desired spar- 
sification of the solution. These different optimization strategies are common 
and will be considered further in the following. 


4.4 Optimization as the Cornerstone of Regression 


In the previous two sections of this chapter, the fitting function f(x) was speci- 
fied. For instance, it may be desirable to produce a line fit so that f(x) = 62+. 
The coefficients are then found by the regression and optimization methods al- 
ready discussed. In what follows, our objective is to develop techniques which 
allow us to objectively select a good model for fitting the data, i.e., should one 
use a quadratic or cubic fit? The error metric alone does not dictate a good 
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Figure 4.12: Solutions to an under-determined system with 20 constraints and 
100 unknowns. Panels (a) and (b) show bar plots of the values of the loadings 
of the vectors x. In panel (a), the optimization is subject to minimizing the 42- 
norm of the solution, while panel (b) is subject to minimizing the ¢;-norm. Note 
that the £; penalization produces a sparse solution vector. A histogram of the 
loading values for (a) and (b) is shown in panels (c) and (d), respectively. 


model selection, as the more terms that are chosen for fitting, the more param- 
eters are available for lowering the error, regardless of whether the additional 
terms have any meaning or interpretability. 

Optimization strategies will play a foundational role in extracting inter- 
pretable results and meaningful models from data. As already shown in pre- 
vious sections, the interplay of the 42- and ¢;-norms has a critical impact on the 
optimization outcomes. To illustrate further the role of optimization and the 
variety of possible outcomes, consider the simple example of data generated 
from noisy measurements of a parabola: 


f(z) =z? +N(0,0), (4.43) 


where N (0, c) is a normally distributed random variable with mean zero and 
standard deviation o. Figure |4.13(a) shows an example of 100 random mea- 
surements of (4.43). The parabolic structure is clearly evident despite the noise 
added to the measurement. Indeed, a parabolic fit is trivial to compute using 
the classic least-squares fitting methods outlined in the first section of this chap- 
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Polynomial degree T- E 


Figure 4.13: (a) One hundred realizations of the parabolic function with 
additive white noise parameterized by o = 0.1. Although the noise is small, the 
least-squares fitting procedure produces significant variability when fitting to a 
polynomial of degree 20. Panels (b)-(e) demonstrate the loadings (coefficients) 
for the various polynomial coefficients for four different noise realizations. This 
demonstrated model variability frames the model selection architecture. 


ter. 

The goal is to discover the best model for the data given. So, instead of speci- 
fying a modela priori, in practice, we do not know what the function is and need 
to discover it. We can begin by positing a regression to a set of polynomial mod- 
els. In particular, consider framing the model selection problem Y = f(X, 6) of 
(4.4) as the following system Ax = b: 


Lo LI] R 
1 2; r e y : = l , (4.44) 
| || | Bp Hend 


where the matrix A contains polynomial models up to degree p — 1, with each 
row representing a measurement, the 6, are the coefficients for each polyno- 
mial, and the matrix b contains the outcomes (data) f(x;). In what follows, 
we will consider a scenario where 100 measurements are taken and a 20-term 
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(19th-order) polynomial is fit. Thus the matrix system Ax = b results in an 
over-determined system as illustrated in Fig. /4.8| Figure |4.13(b)—(e) shows four 
typical loadings 6 computed from the regression procedure. Note that, despite 
the low level of noise added, the loadings are significantly different from one 
another. Thus each noise realization produces a very different model to explain 
the data. 

The variability of the regression results is problematic for model selection. 
It suggests that even a small amount of measurement noise can lead to signif- 
icantly different conclusions about the underlying model. In what follows, we 
quantify this variability while also considering various regression procedures 
for solving the over-determined linear system Ax = b. Highlighted here are 
five standard methods: least-squares regression (pinv), the backslash opera- 
tor (\), (least absolute shrinkage and selection operator) LASSO (lasso), robust 
fit (robustfit), and ridge regression (ridge). Returning to the last section, and 
specifically (4.40), helps frame the mathematical architecture for these various 
Ax = b solvers. Specifically, the Moore-Penrose pseudo-inverse (pinv) solves 
(4.40) with A; = Ay = 0. The backslash command (\) for over-determined sys- 
tems solves the linear system via a QR decomposition [711]. The LASSO (lasso) 
solves with à; > 0 and \2 = 0. Ridge regression (ridge) solves with 
A, = 0 and àz > 0. However, the modern implementation of ridge in MATLAB 
is a bit more nuanced. The popular elastic net algorithm weights both the ¢2 and 
lı penalty, thus providing a tunable hybrid model regression between ridge 
and LASSO. Robust fit (robustfit) solves by a weighted least-squares fit- 
ting. Moreover, it allows one to leverage robust statistics methods and penalize 
according to the Huber norm so as to promote outlier rejection [343]. In the 
data considered here, no outliers are imposed on the data, so that the power 
of robust fit is not properly leveraged. Regardless, it is an important technique 
one should consider. 

Figure|4.14|shows a series of box plots for 100 realizations of data that illus- 
trate the differences with the various regression techniques considered. It also 
highlights critically important differences with optimization strategies based 
on the 42- and ¢;-norm. From a model selection point of view, the least-squares 
fitting procedure produces significant variability in the loading parameters 3 
as illustrated in Fig. [4.14{a,b,e). The least-squares fitting was produced by the 
Moore-Penrose pseudo-inverse or QR decomposition, respectively. If some 4 
penalty (regularization) is allowed, then Fig.|4.14{c,d,f) show that a more parsi- 
monious model is selected with low variability. This is expected, as the ¢;-norm 
sparsifies the solution vector of loading values 6. Indeed, the standard LASSO 
regression correctly selects the quadratic polynomial as the dominant contribu- 
tion to the data. 

Despite the significant variability exhibited in Fig.|4.14]for most of the load- 
ing values by the different regression techniques, the error produced in the 
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Figure 4.14: Comparison of regression methods for Ax = b for an over- 
determined system of linear equations. The 100 realizations of data are gen- 
erated from a simple parabola that is fit to a 20th-degree polynomial via 
(4.44). The box plots show (a) least-squares regression via the Moore-Penrose 
pseudo-inverse (pinv), (b) the backslash command (\), (c) LASSO regression 
(lasso), (d) LASSO regression with different l> versus ¢; penalization, (e) robust 
fit, and (f) ridge regression. Note the significant variability in the loading val- 
ues for the strictly /2-based methods ((a), (b), and (e)), and the low-variability 
for £;-weighted methods and ridge ((c), (d), and (f)). Only the standard LASSO 
(c) identifies the dominance of the parabolic term. 


fitting procedure has little variability. Moreover, the various methods all pro- 
duce regressions that have comparable error. Thus despite their differences in 
optimization frameworks, the error from fitting is relatively agnostic to the un- 
derlying method. This suggests that using the error alone as a metric for model 
selection is potentially problematic, since almost any method can produce a re- 
liable, low-error model. Figure /4.15{a) shows a box plot of the error produced 
using the regression methods of Fig. All of the regression techniques pro- 
duce comparably low-error and low-variability results using significantly dif- 
ferent strategies. 
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Figure 4.15: (a) Comparison of the error for the six regression methods used in 
Fig. Despite the variability across the optimization methods, all of them 
produce low-error solutions. (b) Error using least-squares regression as a func- 
tion of increasing degree of polynomial. The error drops rapidly until the quad- 
ratic term is used in the regression. (c) Detail of the error showing that the error 
actually increases slightly by using a higher-degree polynomial to fit the data. 


As a final note to this section and the code provided, we can consider in- 
stead the regression procedure as a function of the number of polynomials in 
(4.44). In our example of Fig. polynomials up to degree 20 were consid- 
ered. If, instead, we sweep through polynomial degrees, then something inter- 
esting and important occurs, as illustrated in Fig. [4.15{(b)-(c). Specifically, the 
error of the regression collapses to 10~° after the quadratic term is added, as 
shown in panel (b). This is expected since the original model was a quadratic 
function with a small amount of noise. Remarkably, as more polynomial terms 
are added, the ensemble error actually increases in the regression procedure, as 
highlighted in panel (c). Thus simply adding more terms does not improve the 
error, which is counter-intuitive at first. Note that we have only swept through 
polynomials up to degree 10. Note further that Fig. |4.15{c) is a detail of panel 
(b). The error produced by a simple parabolic fit is approximately twice as good 
as a polynomial with degree 10. These results will help frame our model selec- 
tion framework of the remaining sections. 


4.5 The Pareto Front and Lex Parsimoniae 


The preceding chapters have shown that regression is more nuanced than sim- 
ply choosing a model and performing a least-squares fit. Not only are there 
numerous metrics for constraining the solution, the model itself should be care- 
fully selected in order to achieve a better, more interpretable description of the 
data. Such considerations on an appropriate model date back to William of Oc- 
cam (c. 1287-1347), who was an English Franciscan friar, scholastic philosopher, 
and theologian. Occam proposed his law of parsimony (in Latin lex parsimo- 
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Figure 4.16: For model selection, the criteria of accuracy (low error) is balanced 
against parsimony. There can be a variety of models with the same number of 
terms (green and magenta points), but the Pareto frontier (magenta points) is 
defined by the envelope of models that produce the lowest error for a given 
number of terms. The solid line provides an approximation to the Pareto fron- 
tier. The Pareto optimal solutions (shaded region) are those models that produce 
accurate models while remaining parsimonious. 


niae), commonly known as Occam’s razor, whereby he stated that, among com- 
peting hypotheses, the one with the fewest assumptions should be selected, or 
when you have two competing theories that make exactly the same predictions, 
the simpler one is the more likely. The philosophy of Occam’s razor has been 
used extensively throughout the physical and biological sciences for develop- 
ing governing equations to model observed phenomena. 

Parsimony also plays a central role in the mathematical work of Vilfredo 
Pareto (c. 1848-1923). Pareto was an Italian engineer, sociologist, economist, 
political scientist, and philosopher. He made several important contributions 
to economics, specifically in the study of income distribution and in the analy- 
sis of individuals’ choices. He was also responsible for popularizing the use of 
the term elite in social analysis. In more recent times, he has become known for 
the popular 80/20 rule, which is qualitatively illustrated in Fig. named af- 
ter him as the “Pareto principle” by management consultant Joseph M. Juran in 
1941. Stated simply, it is a common principle in business and consulting man- 
agement, for instance, that observes that 80% of sales come from 20% of clients. 
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This concept was popularized by Richard Koch’s book The 80/20 Principle 
(along with several follow-up books 398]), which illustrated a num- 
ber of practical applications of the Pareto principle in business management 
and life. 

Pareto and Occam ultimately advocated the same philosophy: explain the 
majority of observed data with a parsimonious model. Importantly, model se- 
lection is not simply about reducing error; rather, it is about producing a model 
that has a high degree of interpretability, generalization, and predictive capa- 
bilities. Figure [4.16|shows the basic concept of the Pareto frontier and Pareto op- 
timal solutions. Specifically, for each model considered, the number of terms 
and the error in matching the data are computed. The solutions with the lowest 
error for a given number of terms define the Pareto frontier. Those parsimo- 
nious solutions that optimally balance error and complexity are in the shaded 
region and represent the Pareto optimal solutions. In game theory, the Pareto 
optimal solution is thought of as a strategy that cannot be made to perform 
better against one opposing strategy without performing less well against an- 
other (in this case, error and complexity). In economics, it describes a situation 
in which the profit of one party cannot be increased without reducing the profit 
of another. Our objective is to select, in a principled way, the best model from 
the space of Pareto optimal solutions. To this end, information criteria, which 
will be discussed in subsequent sections, will be used to select from candidate 
modes in the Pareto optimal region. 


Overfitting 


The Pareto concept needs amending when considering application to real data. 
Specifically, when building models with many free parameters, which is of- 
ten the case in machine learning applications with high-dimensional data, it is 
easy to overfit a model to the data. Indeed, the increase in error illustrated in 
Fig. [4.15{c) as a function of increasing model complexity illustrates this point. 
Thus, unlike what is depicted in Fig. where the error goes towards zero 
as the number of model terms (parameters) is increased, the error may actually 
increase when considering models with a higher number of terms and/or pa- 
rameters. To determine the correct model, various cross-validation and model 
selection algorithms are necessary. 

To illustrate the overfitting that occurs with real data, consider the sim- 
ple example of the last section. In this example, we are simply trying to find 
the correct parabolic model measured with additive noise (4.43). The results of 
Figs. /4.15{b) and /4.15{c) already indicate that overfitting is occurring for poly- 
nomial models beyond second order. The following MATLAB example will 
highlight the effects of overfitting. Consider the production of a training and 
test set for the parabola of (4.43). The training set is on the region x € [0,4] 
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Figure 4.17: (a) The ideal model f(x) = x? over the domain x € [0,8]. Data is 
collected in the region x € [0, 4] in order to build a polynomial regression model 
(4.44) with increasing polynomial degree. In the interpolation regime x € [0, 4], 
the model error stays constrained, with increasing error due to overfitting for 
polynomials of degree greater than two. The error is shown in panel (b) with a 
zoom-in of the error in panel (c). For extrapolation, x € [4,8], the error grows 
exponentially beyond a parabolic fit. In panel (d), the error is shown to grow 
to 10". A zoom-in of the region on a logarithmic scale of the error (log(E + 
1), where unity is added so that zero error produces a zero score) shows the 
exponential growth of error. This clearly shows that the model trained on the 
interval x € [0,4] does not generalize (extrapolate) to the region x € [4, 8]. This 
example should serve as a serious warning and note of caution in model fitting. 


while the test set (extrapolation region) will be for x € [4, 8]. 

This produces the ideal model on two distinct regions: x € [0,4] and x € 
[4, 8]. Once measurement noise is added to the model, then the parameters for 
a polynomial fit no longer produce the perfect parabolic model. We can com- 
pute for given noisy measurements both an interpolation error, where mea- 
surements are taken in the data regime of x € [0,4], and extrapolation error, 
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where measurements are taken in the data regime of x € [4,8]. For this exam- 
ple, a least-squares regression is performed using the pseudo-inverse (pinv) 
from MATLAB. 

This simple example shows some of the most basic and common features 
associated with overfitting of models. Specifically, overfitting does not allow 
for generalization. Consider the results of Fig. generated from the above 
code. In this example, the least-squares loadings (4.44) for a polynomial are 
computed using the pseudo-inverse for data in the range x € [0, 4]. The interpo- 
lation error for these loadings is demonstrated in Figs. [4.17{b) and (c). Note the 
impact of overfitting by polynomials for this interpolation of the data. Specif- 
ically, the error of the interpolated fit increases from beyond a second-degree 
polynomial. Extrapolation for an overfit model produces significant errors. Fig- 
ure [4.17(d) and (e) show the error growth as a function of the least-squares fit 
pth-degree polynomial model. The error in Fig. [4.17(d) is on a logarithmic plot 
since it grows to 10". This demonstrates a clear inability of the overfit model 
to generalize to the range x € [4,8]. Indeed, only a parsimonious model with 
a second-degree polynomial can easily generalize to the range x € [4,8] while 
keeping the error small. 

The above example shows that some form of model selection to systemati- 
cally deduce a parsimonious model is critical for producing viable models that 
can generalize outside of where data is collected. Much of machine learning 
revolves around (i) using data to generate predictive models, and (ii) applying 
cross-validation techniques to remove the most deleterious effects of overfit- 
ting. Without a cross-validation strategy, one will almost certainly produce a 
non-generalizable model such as that exhibited in Fig. In what follows, 
we will consider some standard strategies for producing reasonable models. 


4.6 Model Selection: Cross-Validation 


The previous section highlights many of the fundamental problems with re- 
gression. Specifically, it is easy to overfit a model to the data, thus leading to a 
model that is incapable of generalizing for extrapolation. This is an especially 
pernicious issue in training deep neural nets. To overcome the consequences 
of overfitting, various techniques have been proposed to more appropriately 
select a parsimonious model with only a few parameters, thus balancing the 
error with a model that can more easily generalize or extrapolate. This pro- 
vides a reinterpretation of the Pareto front in Fig. Specifically, the error 
increases dramatically with the number of terms due to overfitting, especially 
when used for extrapolation. 

There are two common mathematical strategies for circumventing the ef- 
fects of overfitting in model selection: cross-validation and computing informa- 
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Figure 4.18: Cross-validation using k-fold strategy with k = 2, 20, and 100 (left, 
middle, and right columns, respectively). Three different regression strategies 
are cross-validated: least-squares fitting of pseudo-inverse, the QR-based back- 
slash, and the sparsity-promoting LASSO. Note that the LASSO for this exam- 
ple produces the quadratic model within even a one- or two-fold validation. 
The backslash-based QR algorithm has a strong signature after 100-fold cross- 
validation, while the least-squares fitting suggests that the quadratic and cubic 
terms are both important even after 100-fold cross-validation. 


tion criteria. This section considers the former, while the latter method is con- 
sidered in the next section. 

Cross-validation strategies are perhaps the most common and critical tech- 
niques in almost all machine learning algorithms. Indeed, one should never 
trust a model unless properly cross-validated. Cross-validation can be stated 
quite simply: Take random portions of your data and build a model. Do this 
k times and average the parameter scores (regression loadings) to produce the 
cross-validated model. Test the model predictions against withheld (extrapo- 
lation) data and evaluate whether the model is actually any good. This com- 
monly used strategy is called k-fold cross-validation. It is simple, intuitively 
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appealing, and the k-fold model-building procedure produces a statistically 
based model for evaluation. 

To illustrate the concept of cross-validation, we will once again consider 
fitting polynomial models to the simple function f(x) = x? (see Fig. (4.18). 
The previous sections of this chapter have already considered this problem in 
detail, looking at both the various regression frameworks available (pseudo- 
inverse, LASSO, robust fit, etc.), as well as their ability to accurately produce a 
model for interpolating and extrapolating data. The following MATLAB code 
considers three regression techniques (least-squares fitting of pseudo-inverse, 
the QR-based backslash, and the sparsity-promoting LASSO) for k-fold cross- 
validation (k = 2, 20, and 100). In this case, one can think of the k snapshots of 
data as trial measurements. As one might expect, there would be an advantage 
as more trials are taken, and k = 100 models are averaged for a final model. 

Figure shows the results of the k-fold cross-validation computations. 
By promoting sparsity (parsimony), the LASSO achieves the desired quadratic 
model after even a single k = 1 fold (i.e., thus this is not even cross-validated). 
In contrast, the least-squares regression (pseudo-inverse) and OR-based regres- 
sion both require a significant number of folds to produce the dominant quad- 
ratic term. The least-squares regression, even after k = 100 folds, still includes 
both a quadratic and a cubic term. 

The final model selection process under k-fold cross-validation often can 
involve a thresholding of terms that are small in the regression. We demonstrate 
the regression using three modeling strategies. Although the LASSO looks al- 
most ideal, it still has a small contributing linear component. The QR strategy of 
backslash produces a number of small components scattered among the poly- 
nomials used in the fit. The least-squares regression has the dominant quadratic 
and cubic terms with a large number of non-zero coefficients scattered across 
the polynomials. If one thresholds the loadings, then the LASSO and backslash 
will produce exactly the quadratic model, while the least-squares fit produces 
a quadratic-cubic model. The loading coefficients are thresholded to produce 
the final cross-validated model. This model can then be evaluated against both 
the interpolated and extrapolated data regions as in Fig. 

The results of Fig. show that the model selection process, and the re- 
gression technique used, makes a critical difference in producing a viable model. 
It further shows that, despite a k-fold cross-validation, the extrapolation error, 
or generalizability, of the model can still be poor. A good model is one that 
keeps errors small and also generalizes well, as does the LASSO in the previ- 
ous example. 
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Figure 4.19: Error and loading results for k = 100-fold cross-validation. The 
loadings for the k-fold validation (panel (b) with thresholding denoted by sub- 
script +, and panel (a) without thresholding) are shown for least-squares fit- 
ting of pseudo-inverse, the QR-based backslash, and the sparsity-promoting 
LASSO (see Fig. [4.18). Both the interpolation error (panel (c) and detail in (e)) 
and extrapolation error (panel (d) and detail in (f)) are computed. The LASSO 
performs well for both interpolation and extrapolation, while a least-squares 
fit gives poor performance under extrapolation. The six models considered 
are: 1, pseudo-inverse; 2, backslash; 3, LASSO; 4, thresholded pseudo-inverse; 
5, thresholded backslash; and 6, thresholded LASSO. 


5 


k-Fold Cross- Validation 


The process of k-fold cross-validation is highlighted in Fig. The concept is 
to partition a data set into a training set and a test set. The test set, or withhold 
set, is kept separate from any training procedure for the model. Importantly, the 
test set is where the model produces an extrapolation approximation, which 
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Figure 4.20: Procedure for k-fold cross-validation of models. The data is initially 
partitioned into a training set and test (withhold) set. Typically, the withhold 
set is generated from a random sample of the overall data. The training data 
is partitioned into k-folds whereby a random sub-selection of the training data 
is collected in order to build a regression model Y; = f(X;, 6;). Importantly, 
each model generates the loading parameters 3;. After the k-fold models are 
generated, the best model Y = f(X, 3) is produced. There are different ways to 
get the best model; in some cases, it may be appropriate to average the model 
parameters so that 3 = (1/k) De ßB;. One could also simply pick the best 
parameters from the k-fold set. In either case, the best model is then tested on 
the withheld data to evaluate its viability. 


the figures of the last two sections show to be challenging. In k-fold cross- 
validation, the training data is further partitioned into k-folds, which are typi- 
cally randomly selected portions of the data. For instance, in standard 10-fold 
cross-validation, the training data is randomly partitioned into 10 partitions (or 
folds). Each partition is used to construct a regression model Y; = f(X}, 8;) 
for j = 1,2,--- ,10. One method for constructing the final model is to aver- 
age the loading values 6 = (1/k) 3 B;, which are then used for the final, 
cross-validated regression model Y = f(X, 3). This model is then used on the 
withhold data to test its extrapolation power, or generalizability. The error on 
this withhold test set is what determines the efficacy of the model. There are a 
variety of other methods for selecting the best model, including simply choos- 
ing the best of the k-fold models. As for partitioning the data, a common strat- 
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egy is to break the data into 70% training data, 20% validation data, and 10% 
withheld data. For very large data sets, the validation and withheld data sets 
can be reduced provided there is enough data to accurately assess the model 
constructed. 


Leave-p-Out Cross- Validation 


Another standard technique for cross-validation involves the so-called leave- 
p-out cross-validation (LpO CV). In this case, p samples of the training data are 
removed from the data and kept as the validation set. A model is built on the re- 
maining training data, and the accuracy of the model is tested on the p withheld 
samples. This is repeated with a new selection of p samples until all the training 
data has been part of the validation data set. The accuracy of the model is then 
evaluated on the withheld data from averaging the accuracy of the models and 
the loadings produced from the various partitions of the data. 


4.7 Model Selection: Information Criteria 


There is a different approach to model selection than the cross-validation strate- 
gies outlined in the previous section. Indeed, model selection has a rigorous 
set of mathematical innovations starting from the early 1950s. The Kullback— 
Leibler (KL) divergence measures the distance between two probability 
density distributions (or data sets which represent the truth and a model) and 
is the core of modern information theory criteria for evaluating the viability 
of a model. The KL divergence has deep mathematical connections to statis- 
tical methods characterizing entropy as developed by Ludwig E. Boltzmann 
(c. 1844-1906), as well as a relation to information theory developed by Claude 
Shannon [653]. Model selection is a well-developed field with a large body of 
literature, most of which is exceptionally well reviewed by Burnham and An- 
derson [142]. In what follows, only brief highlights will be given to demonstrate 
some of the standard methods. 
The KL divergence between two models f(X, 6) and g(X, js) is defined as 


(a) = f p) | A ax. (4.45) 


where 6 and p are parameterizations of the models f(-) and g(-), respectively. 
From an information theory perspective, the quantity I (f, g) measures the in- 
formation lost when g is used to represent f. Note that if f = g, then the log 
term is zero (i.e., log(1) = 0) and I(f, g) = 0, so that there is no information lost. 
In practice, f will represent the truth, or measurements of an experiment, while 
g will be a model proposed to describe f. 
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Figure 4.21: Comparison of three models gı (x), g2(x), and g3(x) against the truth 
model f(x). The KL divergence I;(f, gj) for each model is computed, showing 
that the model gı (x) is closest to statistically representing the true data. 


Unlike the regression and cross-validation performed previously, when com- 
puting KL divergence, a model must be specified. Recall that we used cross- 
validation previously to generate a model using different regression strategies 
(see Fig. |4.20|for instance). Here a number of models will be posited and the loss 
of information, or KL divergence, of each model will be computed. The model 
with the lowest loss of information is generally regarded as the best model. 
Thus given M proposed models 9;(X, 1;), where j = 1,2,..., M, we can com- 
pute I;(f, 9;) for each model. The correct model, or best model, is the one that 
minimizes the information loss min, 1;( f, 9;). 

As a simple example, consider Fig. which shows three different models 
that are compared to the truth data. The computation of the KL divergence 
score is also illustrated. Note that, in order to avoid division by zero, a constant 
offset is added to each probability distribution. The truth data generated, f(x), 
is a simple normally distributed variable. The three models shown are variants 
of normally and uniformly distributed functions. 
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Information Criteria: AIC and BIC 


This simple example shows the basic ideas behind model selection: compute a 
distance between a proposed model output g;(x) and the measured truth f(x). 
In the early 1970s, Hirotugu Akaike combined Fisher’s maximum-likelihood 
computation with the KL divergence score to produce what is now called 
the Akaike information criterion (AIC) [9]. This was later modified by Gideon 
Schwarz to the so-called Bayesian information criterion (BIC) [646], which pro- 
vided an information score that was guaranteed to converge to the correct 
model in the large-data limit, provided the correct model was included in the 
set of candidate models. 

To be more precise, we turn to Akaike’s seminal contribution [9]. Akaike 
was aware that KL divergence cannot be computed in practice since it requires 
full knowledge of the statistics of the truth model f(x) and of all the parame- 
ters in the proposed models g;(x). Thus, Akaike proposed an alternative way 
to estimate KL divergence based on the empirical log-likelihood function at its 
maximum point. This is computable in practice and was a critically enabling in- 
sight for rigorous methods of model selection. The technical aspects of Akaike’s 
work connecting log-likelihood estimates and KL divergence [9] was a 
paradigm shifting mathematical achievement, and thus led to the development 
of the AIC score 

AIC = 2K — 2log|[L(f|x)], (4.46) 


where K is the number of parameters used in the model, jz is an estimate of the 
best parameters used (i.e., lowest KL divergence) in g(X, u) computed from a 
maximum-likelihood estimate (MLE), and x are independent samples of the data 
to be fit. Thus, instead of a direct measure of the distance between two models, 
the AIC provides an estimate of the relative distance between the approximat- 
ing model and the true model or data. As the number of terms gets large in 
a proposed model, the AIC score increases with slope 2K, thus providing a 
penalty for non-parsimonious models. Importantly, due to its relative measure, 
it will always result in an objective “best” model with the lowest AIC score, but 
this best model may still be quite poor in prediction and reconstruction of the 
data. 

AIC is one of the standard model selection criteria used today. However, 
there are others. Highlighted here is the modification of AIC by Schwarz to 
construct BIC [646]. BIC is almost identical to AIC aside from the penalization 
of the information criteria by the number of terms. Specifically, BIC is defined 
as 

BIC = log(n)K — 2 log|L(pt|x)], (4.47) 


where n is the number of data points, or sample size, considered. This slightly 
different version of the information criterion has one significant consequence. 
The seminal contribution of Schwarz was to prove that, if the correct model was 
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included along with a set of candidate models, then it would be theoretically 
guaranteed to be selected as the best model based upon BIC for a sufficiently 
large set of data x. This is in contrast to AIC, which, in certain pathological 
cases, can select the wrong model. 


Computing AIC and BIC Scores 


MATLAB allows us to directly compute the AIC and/or BIC score from the 
aicbic command. This computational tool is embedded in the econometrics 
toolbox, and it allows one to evaluate a set of models against one another. The 
evaluation is made from the log-likelihood estimate of the models under con- 
sideration. An arbitrary number of models can be compared. 

In the specific example considered here, we consider a ground-truth model 
constructed from the autoregressive model 


Ln = —4+ 0.201 + 0.52n_2 + N(0, 2), (4.48) 


where z, is the value of the time series at time t,, and N (0, 2) is a white-noise 
process with mean zero and variance two. We fit three autoregressive inte- 
grated moving average (ARIMA) models to the data. The three ARIMA models 
have one, two, and three time delays in their models. The following code com- 
putes their log-likelihood and corresponding AIC and BIC scores. 


Code 4.6: [MATLAB] Computation of AIC and BIC scores. 


T = 100; % Sample size 
DGE = arima (Constant ,—4 AR; O2; 0o], Variance, 2); 
y = simulate(DGP,T); 


EstMd11 arima(’ARLags’,1); 
BRstMal2 = “arima(’ARLags” , 1:2); 
EstMd13 arima (AR Were.” 7. es.) 5° 


logL = zeros(3,1); % Preallocate loglikelihood vector 
[ly pleogh:) |) = estimate (EstMdll y); v, print”, false); 
[7 7 bogh(2)] = estimave (EsStMdl2; y); 2, print”, false), 
[~,~,logL(3)] = estimate (EStMd13,y);%,’print’, false); 


laic, bicell = aicbie (Llogi, Ie; 4: 5I Tones (371) 


Code 4.6: [Python] Computation of AIC and BIC scores. 


arparams = np.array([-4, .2, 0.5]) 
maparams = np.array([1]) 
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arma_process = sm.tsa.arima_process.ArmaProcess (arparams, 
maparams) 
y = arma_process.generate_sample(T, scale=2) 


3) # log likelihood vector 
\ EH ATO vector 
\ 4 BIC vector 


logL = np.zeros ( 
aic = np.zeros(3 
bic = np.zeros (s 


for j in range(2): 


model_res = sm.tsa.arima_model.ARMA(y, (0,0)).f1t (trend= 
‘coc’, disp=0,start_ar_lags=j+1,method=’mle’ ) 

logL[j] = model_res.1lf 

aic[j] = model_res.aic 

bic[j] = model_res.bic 


Note that the best model, the one with both the lowest AIC and BIC scores, 
is the second model, which has two time delays. This is expected, as it cor- 
responds to the ground-truth model. The output in this case is given by the 
following. 
aic = 

20 Was 

358.2422 

358.8479 


bic = 
389.5887 
368.6629 
STS 


The lowest AIC and BIC score is 358.2422 and 368.6629, respectively. Note that, 
although the correct model was selected, the AIC score provides little distinc- 
tion between models, especially the two and three time delay models. 
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Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


4.7. MODEL SELECTION: INFORMATION CRITERIA 201 


Homework 


Exercise 4-1. Derive in closed form the 3 x 3 matrix which results from a least- 
squares regression to a parabolic fit f (x) = Ax? + Bx + C. 


Exercise 4-2. Consider the following temperature data taken over a 24-hour 
(military time) cycle: 
75at01 77 at02 76at03 73at04 69at05 68at06 63at07 59 at 08 
57at09 55at10 54at11 52at12 50at13 50at14 49at15 49 at 16 
49at17 50at18 54at19 56at20 59at21 63at22 67at23 72 at 24 


Fit the data with the parabolic fit 
f(z) = Az? + Br +C (4.49) 


and calculate the FE» error. Use both a linear interpolation and spline to generate 
an interpolated approximation to the data for x = 1: 0.01 : 24. 


Develop a least-squares algorithm and calculate E> for 
y = Acos(Br) + C. (4.50) 


Evaluate the resulting fit as a function of the initial guess for the values of A, B, 
and C. 


Exercise 4-3. For the temperature data of the previous example, consider a poly- 
nomial fit of the form 


10 
a= > cin”, (4.51) 
k=0 


where the loadings a; are to be determined by four regression techniques: least- 
squares, LASSO, ridge, and elastic net. Compare the models for each against 
each other. 


Randomly pick any time point and corrupt the temperature measurement at 
that location. For instance, the temperature reading at that location could be 
zero. Investigate the resulting model and E; error for the four regression tech- 
niques considered. Identify the models that are robust to such an outlier and 
those that are not. Explicitly calculate the variance of the loading coefficients 
ak for each method for a number of random trials with one or more corrupt 
data points. 


Exercise 4-4. Download the MNIST data set (both training and test sets and la- 


bels) from|http://yann.lecun.com/exdb/mnist/ 
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The labels will tell you which digit it is: 1, 2, 3, 4, 5, 6, 7,8, 9,0. Let each output 
be denoted by the vector yj. 


1 0 0 0 
0 1 0 0 
0 0 0 0 

P= h |? (| 52 
0 0 1 0 
0 0 0 1 


Now let B be the set of output vectors 


B = [yı Y2 Y3 :-:- Yn (4.53) 


and let the matrix A be the corresponding reshaped (vectorized) MNIST images 


Ao [Ss X X3 ... Xl- (4.54) 


Thus each vector x; € R” is a vector reshaped from the n x n image. 


Using various AX = B solvers, determine a mapping from the image space to 
the label space. 


By promoting sparsity, determine and rank which pixels in the MNIST set are 
most informative for correctly labeling the digits. (You will have to come up 
with your own heuristics or empirical rules for this. Be sure to visualize the re- 
sults from X.) Apply your most important pixels to the test data set to see how 
accurate you are with as few pixels as possible. Redo the analysis with each 
digit individually to find the most important pixels for each digit. Think about 
the interpretation of what you are doing with this AX = B problem. 
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Clustering and Classification 


Machine learning is based upon optimization techniques for data. The goal is 
to find both a low-rank subspace for optimally embedding the data, as well 
as regression methods for clustering and classification of different data types. 
Machine learning thus provides a principled set of mathematical methods for 
extracting meaningful features from data, i.e., data mining, as well as binning 
the data into distinct and meaningful patterns that can be exploited for decision 
making. Specifically, it learns from and makes predictions based on data. For 
business applications, this is often called predictive analytics, and it is at the fore- 
front of modern data-driven decision making. In an integrated system, such as 
is found in autonomous robotics, various machine learning components (e.g., 
for processing visual and tactile stimuli) can be integrated to form what we now 
call artificial intelligence (AI). To be explicit: AI is built upon integrated machine 
learning algorithms, which in turn are fundamentally rooted in optimization. 
There are two broad categories for machine learning: supervised machine 
learning and unsupervised machine learning. In the former, the algorithm is pre- 
sented with labeled data sets. The training data, as outlined in the cross-validation 
method of the last chapter, is labeled by a teacher/expert. Thus examples of the 
input and output of a desired model are explicitly given, and regression meth- 
ods are used to find the best model for the given labeled data, via optimization. 
This model is then used for prediction and classification using new data. There 
are important variants of supervised methods, including semi-supervised learn- 
ing in which incomplete training is given so that some of the input/output 
relationships are missing, i.e., for some input data, the actual output is missing. 
Active learning is another common subclass of supervised methods whereby the 
algorithm can only obtain training labels for a limited set of instances, based 
on a budget, and also has to optimize its choice of objects for which to ac- 
quire labels. In an interactive framework, these can be presented to the user 
for labeling. Finally, in reinforcement learning, rewards or punishments are the 
training labels that help shape the regression architecture in order to build the 
best model. In contrast, no labels are given for unsupervised learning algorithms. 
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Thus, they must find patterns in the data in a principled way in order to de- 
termine how to cluster data and generate labels for predicting and classifying 
new data. In unsupervised learning, the goal itself may be to discover patterns 
in the data embedded in the low-rank subspaces so that feature engineering or 
feature extraction can be used to build an appropriate model. 

In this chapter, we will consider some of the most commonly used super- 
vised and unsupervised machine learning methods. As will be seen, our goal is 
to highlight how data mining can produce important data features (feature en- 
gineering) for later use in model building. We will also show that the machine 
learning methods can be broadly used for clustering and classification, as well 
as for building regression models for prediction. Critical to all of this machine 
learning architecture is finding low-rank feature spaces that are informative 
and interpretable. 


5.1 Feature Selection and Data Mining 


To exploit data for diagnostics, prediction, and control, dominant features of the 
data must be extracted. In the opening chapter of this book, singular value de- 
composition (SVD) and principal component analysis (PCA) were introduced 
as methods for determining the dominant correlated structures contained within 
a data set. In the eigenfaces example of Section |[1.6} for instance, the dominant 
features of a large number of cropped face images were shown. These eigen- 
faces, which are ordered by their ability to account for commonality (corre- 
lation) across the database of faces, were guaranteed to give the best set of r 
features for reconstructing a given face in an f> sense with a rank-r trunca- 
tion. The eigenface modes gave clear and interpretable features for identifying 
faces, including highlighting the eyes, nose, and mouth regions, as might be 
expected. Importantly, instead of working with the high-dimensional measure- 
ment space, the feature space allows one to consider a significantly reduced 
subspace where diagnostics can be performed. 

The goal of data mining and machine learning is to construct and exploit the 
intrinsic low-rank feature space of a given data set. The feature space can be found 
in an unsupervised fashion by an algorithm, or it can be explicitly constructed 
by expert knowledge and/or correlations among the data. For eigenfaces, the 
features are the PCA modes generated by the SVD. Thus each PCA mode is 
high-dimensional, but the only quantity of importance in feature space is the 
weight of that particular mode in representing a given face. If one performs 
an r-rank truncation, then any face needs only r features to represent it in fea- 
ture space. This ultimately gives a low-rank embedding of the data in an in- 
terpretable set of r features that can be leveraged for diagnostics, prediction, 
reconstruction, and/or control. 
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Figure 5.1: Fisher iris data set with 150 measurements over three varieties, in- 
cluding 50 measurements each of Iris setosa, I. versicolor, and I. virginica. Each 
flower includes a measurement of sepal length, sepal width, petal length, and 
petal width. The first three of these are illustrated here, showing that these sim- 
ple biological features are sufficient to show that the data has distinct, quantifi- 
able differences between the species. 


Several examples will be developed that illustrate how to generate a feature 
space, starting with a standard data set included with MATLAB. The Fisher 
iris data set includes measurements of 150 irises of three varieties: Iris setosa, I. 
versicolor, and I. virginica. The 50 samples of each flower include measurements 
in centimeters of the sepal length, sepal width, petal length, and petal width. 
For this data set, the four features are already defined in terms of interpretable 
properties of the biology of the plants. For visualization purposes, Fig. [5.1|con- 
siders only the first three of these features. The following code accesses the 
Fisher iris data set: 


Code 5.1: [MATLAB] Features of the Fisher irises. 


load fisheriris; 


xl=meas(1:50,:); % setosa 
x2=meas (51:100,:); versicolor 
x2 meas (OME 50, 3); © VIrginica 


Code 5.1: [Python] Features of the Fisher irises. 


tioheriris mat — Oo. loadmat OS. Path- JOrn wn, DATAT, 
Frsheriris.mat”)) 
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Figure 5.2: Example images of dogs (left) and cats (right). Our goal is to con- 
struct a feature space where automated classification of these images can be 
efficiently computed. 


meas = fisheriris_mat[’meas’ ] 

xl = meas[:50,:] # setosa 

x2 = meas[50:100,:] # versicolor 
xo meas lOO all 7 vesotaica 


Figure shows that the properties measured can be used as a good set 
of features for clustering and classification purposes. Specifically, the three iris 
varieties are well separated in this feature space. The setosa is most distinc- 
tive in its feature profile, while the versicolor and virginica have a small over- 
lap among the samples taken. For this data set, machine learning is certainly 
not required to generate a good classification scheme. However, data generally 
does not so readily reduce down to simple two- and three-dimensional visual 
cues. Rather, decisions about clustering in feature space occur with many more 
variables, thus requiring the aid of computational methods to provide good 
classification schemes. 

As a second example, we consider in Fig. a selection from an image 
database of 80 dogs and 80 cats. A specific goal for this data set is to develop 
an automated classification method whereby the computer can distinguish be- 
tween cats and dogs. In this case, the data for each cat and dog is the 64 x 64 
pixel space of the image. Thus each image has 4096 measurements, in contrast 
to the four measurements for each example in the iris data set. Like eigenfaces, 
we will use the SVD to extract the dominant correlations among the images. 
The following code loads the data and performs a singular value decomposi- 
tion on the data after the mean is subtracted. The SVD produces an ordered 
set of modes characterizing the correlation between all the dog and cat images. 
Figure [5.3| shows the first four SVD modes of the 160 images (80 dogs and 80 
cats). 


Code 5.2: [MATLAB] Features of dogs and cats. 


load dogData.mat 
load catData.mat 


Tgi 
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(b) 


(d) 


Figure 5.3: First four features (a)-(d) generated from the SVD of the 160 images 
of dogs and cats, i.e., these are the first four columns of the U matrix of the 
SVD. Typical cat and dog images are shown in Fig. Note that the first two 
modes (a) and (b) show that the triangular ears are important features when 
images are correlated. This is certainly a distinguishing feature for cats, while 
dogs tend to lack this feature. Thus, in feature space, cats generally add these 
two dominant modes to promote this feature, while dogs tend to subtract these 
features to remove the triangular ears from their representation. 


CD=double([dog cat]); 
[u,s,v]=svd(CD-mean(CD(:)),’econ’); 


Code 5.2: [Python] Features of dogs and cats. 


dog = dogdata_mat[’dog’ ] 

cat = catdata_mat[’cat’ ] 

CD = np.concatenate( (dog, cat), axis=1) 

Ups, 7 Ll = np. danalg.sva (CD—-np mean (CD), cull imabrvees=—0)) 


The original image space, or pixel space, is only one potential set of data to 
work with. The data can be transformed into a wavelet representation where 
edges of the images are emphasized. The following code loads the images in 
their wavelet representation and computes a new low-rank embedding space. 


Code 5.3: [MATLAB] Wavelet features of dogs and cats. 


load catData_w.mat 
load dogData_w.mat 
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(d) 


Figure 5.4: First four features (a)-(d) generated from the SVD of the 160 images 
of dogs and cats in the wavelet domain. As before, the first two modes (a) and 
(b) show that the triangular ears are important. This is an alternative represen- 
tation of the dogs and cats that can help better classify dogs versus cats. 


CD2=[dog_wave cat_wave]; 
[u2,s2,v2]=svd(CD2-mean (CD2(:)),’econ’); 


Code 5.3: [Python] Wavelet features of dogs and cats. 
dog_wave = dogdata_w_mat [’ dog_wave’ ] 
cat_wave = catdata_w_mat[’ cat_wave’ ] 
CD2 = np.concatenate( (dog_wave, cat_wave) , axis=1) 
U2 se, vile = np. linalg:sxa(CD2-np:mean (Cbs), call matrices 0) 


The equivalent of Fig. .3]in wavelet space is shown in Fig. Note that the 
wavelet representation helps emphasize many key features such as the eyes, 
nose, and ears, potentially making it easier to make a classification decision. 
Generating a feature space that enables classification is critical for constructing 
effective machine learning algorithms. 

Whether using the image space directly or a wavelet representation, Figs..3] 
and [5.4] respectively, the goal is to project the data onto the feature space gen- 
erated by each. A good feature space helps find distinguishing features that 
allow one to perform a variety of tasks that may include clustering, classifica- 
tion, and prediction. The importance of each feature to an individual image is 
given by the V matrix in the SVD. Specifically, each column of V determines 
the loading, or weighting, of each feature onto a specific image. Histograms of 
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Raw images Wavelet images 


Figure 5.5: Histogram of the distribution of loadings for dogs (blue) and cats 
(red) on the first four dominant SVD modes. The left panels show the distribu- 
tions for the raw images (see Fig. while the right panels show the distri- 
bution for wavelet-transformed data (see Fig. 5.4). The loadings come from the 
columns of the V matrix of the SVD. Note the good separability between dogs 
and cats using the second mode. 


these loadings can then be used to visualize how distinguishable cats and dogs 
are from each other by each feature (see Fig. 5.5). The following code produces 
a histogram of the distribution of loadings for the dogs and the cats (first 80 
images versus second 80 images, respectively). 


Code 5.4: [MATLAB] Feature histograms of dogs and cats. 


xbin=linspace (-0.25,0.25,20); 
for j=1:4 
subplot (4,2,2*j-1) 
pdf- hist G7 (13807 J) sbi) 
pdf2=hist (v(81:160, 4) ,xbin) 
plot (xbin, pall, xban, pats, Linewidth, 2I) 


end 


Code 5.4: [Python] Feature histograms of dogs and cats. 


pdfl = np. histogram(vI[q, S0 bins=xbin_edges)) Lol 
pdfiy = np histogram (vi 803], bans=xbin_edges)) Lol 
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Raw images Wavelet images 
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Figure 5.6: Projection of dogs (green) and cats (magenta) into feature space. 
Note that the raw images and their wavelet counterparts produce different em- 
beddings of the data. Both exhibit clustering around their labeled states of dogs 
and cats. This is exploited in the learning algorithms that follow. The wavelet 
images are especially good for clustering and classification, as this feature space 
more easily separates the data. 


Figure |5.5|shows the distribution of loading scores for the first four modes 
for both the raw images as well as the wavelet-transformed images. For both 
the sets of images, the distribution of loadings on the second mode clearly 
shows a strong separability between dogs and cats. The wavelet-processed im- 
ages also show a nice separability on the fourth mode. Note that the first mode 
for both shows very little discrimination between the distributions and is thus 
not useful for classification and clustering objectives. 

Features that provide strong separability between different types of data 
(e.g., dogs and cats) are typically exploited for machine learning tasks. This 
simple example shows that feature engineering is a process whereby an ini- 
tial data exploration is used to help identify potential pre-processing methods. 
These features can then help the computer identify highly distinguishable fea- 
tures in a higher-dimensional space for accurate clustering, classification, and 
prediction. As a final note, consider Fig. which projects the dogs and cats 
data onto the first three PCA modes (SVD modes) discovered from the raw 
images or their wavelet-transformed counterparts. As will be seen later, the 
wavelet-transformed images provide a higher degree of separability, and thus 
improved classification. 


5.2 Supervised versus Unsupervised Learning 


As previously stated, the goal of data mining and machine learning is to con- 
struct and exploit the intrinsic low-rank feature space of a given data set. Good 
feature engineering and feature extraction algorithms can then be used to learn 
classifiers and predictors for the data. Two dominant paradigms exist for learn- 
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ing from data: supervised methods and unsupervised methods. Supervised data- 
mining algorithms are presented with labeled data sets, where the training data 
is labeled by a teacher/expert/supervisor. Thus examples of the input and out- 
put of a desired model are explicitly given, and regression methods are used 
to find the best model via optimization for the given labeled data. This model 
is then used for prediction and classification using new data. There are impor- 
tant variants of this basic architecture which include semi-supervised learning, 
active learning, and reinforcement learning. For unsupervised learning algo- 
rithms, no training labels are given, so that an algorithm must find patterns in 
the data in a principled way in order to determine how to cluster and classify 
new data. In unsupervised learning, the goal itself may be to discover patterns 
in the data embedded in the low-rank subspaces so that feature engineering or 
feature extraction can be used to build an appropriate model. 

To illustrate the difference in supervised versus unsupervised learning, con- 
sider Fig. This shows a scatter plot of two Gaussian distributions. In one 
case, the data are well separated so that their means are sufficiently far apart 
and two distinct clusters are observed. In the second case, the two distributions 
are brought close together so that separating the data is a challenging task. The 
goal of unsupervised learning is to discover clusters in the data. This is a trivial 
task by visual inspection, provided the two distributions are sufficiently sep- 
arated. Otherwise, it becomes very difficult to distinguish clusters in the data. 
Supervised learning provides labels for some of the data. In this case, points 
are labeled with either green dots or magenta dots and the task is to classify 
the unlabeled data (grey dots) as either green or magenta. Much like the unsu- 
pervised architecture, if the statistical distributions that produced the data are 
well separated, then using the labels in combination with the data provides a 
simple way to classify all the unlabeled data points. Supervised algorithms also 
perform poorly if the data distributions have significant overlap. 

Supervised and unsupervised learning can be stated mathematically. Let 


DCR’, (5.1) 
so that D is an open bounded set of dimension n. Further, let 
PER (5.2) 


The goal of classification is to build a classifier labeling all data in D given data 
from D’. 

To make our problem statement more precise, consider a set of data points 
xj € R” and labels y; for each point, where j = 1,2,...,m. Labels for the data 
can come in many forms, from numeric values, including integer labels, to text 
strings. For simplicity, we will label the data in a binary way as either plus one 
or minus one, so that y; € {+1}. 
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Unsupervised Supervised 


Figure 5.7: Illustration of unsupervised versus supervised learning. In panels 
(a) and (c), unsupervised learning attempts to find clusters for the data in or- 
der to classify them into two groups. For well-separated data (a), the task is 
straightforward and labels can easily be produced. For overlapping data (c), it 
is a very difficult task for an unsupervised algorithm to accomplish. In panels 
(b) and (d), supervised learning provides a number of labels: green balls and 
magenta balls. The remaining unlabeled data is then classified as green or ma- 
genta. For well-separated data (b), labeling data is easy, while overlapping data 
presents significant challenge. 


For unsupervised learning, the following inputs and outputs are then asso- 
ciated with learning a classification task: 


Input 

data {x; € R”, j € Z:={1,2,...,m}}, (5.3a) 
Output 

labels {y; € {£1}, j € Z}. (5.3b) 


Thus the mathematical framing of unsupervised learning is focused on pro- 
ducing labels y, for all the data. Generally, the data x; used for training the 
classifier is from D’. The classifier is then more broadly applied, i.e., it general- 
izes, to the open bounded domain D. If the data used to build a classifier only 
samples a small portion of the larger domain, then it is often the case that the 
classifier will not generalize well. 
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Supervised learning provides labels for the training stage. The inputs and 
outputs for this learning classification task can be stated as follows: 


Input 
data {x; E€ R”, j € Z := {1,2,...,m}}, (5.4a) 
labels {y; € {+1}, j E€ Z C Z}, (5.4b) 
Output 
labels {y; € {+1}, j € Z}. (5.4c) 


In this case, a subset of the data is labeled and the missing labels are provided 
for the remaining data. Technically speaking, this is a semi-supervised learning 
task, since some of the training labels are missing. For supervised learning, 
all the labels are known in order to build the classifier on D’. The classifier is 
then applied to D. As with unsupervised learning, if the data used to build a 
classifier only samples a small portion of the larger domain, then it is often the 
case that the classifier will not generalize well. 

For the data sets considered in our feature selection and data-mining sec- 
tion, we can consider in more detail the key components required to build a 
classification model: x;, y;, D, and D’. The Fisher iris data of Fig. .1is a clas- 
sic example for which we can detail these quantities. We begin with the data 
collected: 


x; = {sepal length, sepal width, petal length, petal width}. (5.5) 


Thus each iris measurement contains four data fields, or features, for our anal- 
ysis. The labels can be one of the following: 


y; = {setosa, versicolor, virginica}. (5.6) 


In this case the labels are text strings, and there are three of them. Note that, in 
our formulation of supervised and unsupervised learning, there were only two 
outputs (binary), which were labeled either +1. Generally, there can be many 
labels, and they are often text strings. Finally, there is the domain of the data. 
For this case, 


D’ € {150 iris samples: 50 setosa, 50 versicolor, and 50 virginica} (5.7) 


and 
D € {the universe of setosa, versicolor, and virginica irises}. (5.8) 


We can similarly assess the dogs and cats data as follows: 
x; = {64 x 64 image = 4096 pixels}, (5.9) 
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Figure 5.8: Classification and regression models for data can be difficult when 
the data have nonlinear functions which separate them. In this case, the func- 
tion separating the green and magenta balls can be difficult to extract. More- 
over, if only a small sample of the data D’ is available, then a generalizable 
model may be impossible to construct for D. The data set in panel (a) repre- 
sents two half-moon shapes that are just superimposed, while the concentric 
rings in panel (b) require a circle as a separation boundary between the data. 
Both are challenging to produce. 


where each dog and cat is labeled as 
y; = {dog, cat} = {1, —1}. (5.10) 


In this case the labels are text strings, which can also be translated to numeric 
values. This is consistent with our formulation of supervised and unsupervised 
learning, where there are only two outputs (binary), labeled either +1. Finally, 
there is the domain of the data, which is 


D’ € {160 image samples: 80 dogs and 80 cats} (5.11) 


and 
D € {the universe of dogs and cats}. (5.12) 


Supervised and unsupervised learning methods aim to create algorithms 
for classification, clustering, or regression. The discussion above is a general 
strategy for classification. The previous chapter discusses regression architec- 
tures. For both tasks, the goal is to build a model from data on D’ that can gen- 
eralize to D. As already shown in the preceding chapter on regression, gener- 
alization can be very difficult, and cross-validation strategies are critical. Deep 
neural networks, which are state-of-the-art machine learning algorithms for re- 
gression and classification, often have difficulty generalizing. Creating strong 
generalization schemes is at the forefront of machine learning research. 

Some of the difficulties in generalization can be illustrated in Fig.|5.8| These 
data sets, although easily classified and clustered through visual inspection, 
can be difficult for many regression and classification schemes. Essentially, the 
boundary between the data forms a nonlinear manifold that is often difficult 
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to characterize. Moreover, if the sampling data D’ only captures a portion of 
the manifold, then a classification or regression model will almost surely fail 
in characterizing D. These are also only two-dimensional depictions of a classi- 
fication problem. It is not difficult to imagine how complicated such data em- 
beddings can be in higher-dimensional space. Visualization in such cases is 
essentially impossible and one must rely on algorithms to extract the mean- 
ingful boundaries separating data. What follows in this chapter and the next 
are methods for classification and regression given data on D’ that may or may 
not be labeled. There is quite a diversity of mathematical methods available for 
performing such tasks. 


5.3 Unsupervised Learning: k-Means Clustering 


A variety of supervised and unsupervised algorithms will be highlighted in 
this chapter. We will start with one of the most prominent unsupervised algo- 
rithms in use today: k-means clustering. The k-means algorithm assumes one is 
given a set of vector-valued data with the goal of partitioning m observations 
into k clusters. Each observation is labeled as belonging to a cluster with the 
nearest mean, which serves as a proxy (prototype) for that cluster. This results 
in a partitioning of the data space into Voronoi cells. 

Although the number of observations and the dimension of the system are 
known, the number of partitions k is generally unknown and must also be de- 
termined. Alternatively, the user simply chooses a number of clusters to extract 
from the data. The k-means algorithm is iterative, first assuming initial values 
for the mean of each cluster and then updating the means until the algorithm 
has converged. Figure depicts the update rule of the k-means algorithm. 
The algorithm proceeds as follows: (i) Given initial values for k distinct means, 
compute the distance of each observation x; to each of the k means. (ii) Label 
each observation as belonging to the nearest mean. (iii) Once labeling is com- 
pleted, find the center-of-mass (mean) for each group of labeled points. These 
new means are then used to start back at step (i) in the algorithm. This is a 
heuristic algorithm that was first proposed by Stuart Lloyd in 1957 [452], al- 
though it was not published until 1982. 

The k-means objective can be stated formally in terms of an optimization 
problem. Specifically, the following minimization describes this process: 


k 
argmin X` bD [Xn — pI, (5.13) 
Mi j=1 xn€Ds 


where u, denotes the mean of the jth cluster and D; denotes the subdomain 


of data associated with that cluster. This minimizes the within-cluster sum of 
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Figure 5.9: Illustration of the k-means algorithm for k = 2. Two initial starting 
values of the mean are given (black +). Each point is labeled as belonging to one 
of the two means. The green balls are thus labeled as part of the cluster with the 
left + and the magenta balls are labeled as part of the right +. Once labeled, the 
mean of the two clusters is recomputed (red +). The process is repeated until 
the means converge. 


squares. In general, solving the optimization problem as stated is NP-hard, 
making it computationally intractable. However, there a number of heuristic 
algorithms that provide good performance despite not having a guarantee that 
they will converge to the globally optimal solution. 

Cross-validation of the k-means algorithm, as well as any machine learning 
algorithm, is critical for determining its effectiveness. Without labels, the cross- 
validation procedure is more nuanced, as there is no ground truth to compare 
with. The cross-validation methods of the last section, however, can still be 
used to test the robustness of the classifier to different sub-selections of the 
data through k-fold cross-validation. The following portions of code generate 
Lloyd’s algorithm for k-means clustering. We first consider making two clusters 
of data and partitioning the data into a training set and a test set. 


Code 5.5: [MATLAB] The Lloyd algorithm for k-means. 
gl=[-1 0]; 921 0l; < Initial guess 
for j=1:4 
classi=[]; class2— |; 
for jj=1:length (Y) 
dl=norm(g1l-Y(jj,:)); 
d2=norm(g2-Y(jj,:)); 


if di<d?2 

elass- [classi; IX OI D) LOr; 
else 

elerne edee a TEA aa D aaa An 
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end 
end 
gl=[mean(classl(l:end,1)) mean(classl(l:end,2))]; 
g2=[mean(class2(l:end,1)) mean(class2(l:end,2))]; 


end 


Code 5.5: [Python] The Lloyd algorithm for k-means. 


gl=np.array([-1,0]); g2=np.array([1,0]) # Initial guess 
for j in range (4): 


classl = np.zeros((1,2)) 
class2 = np.zeros((1,2)) 
for jj in range(Y.shape[0]): 
dl = np. linalo.morm(gl—¥ laa; ||, 0rd=2)) 
d2 = "np. linalg.norm(g2 IJJ: || ordz) 
if di<d2: 
classi - np append (classi; XI], : l- ceshape (17 20) 
, axis=0) 
else: 
class2 = np.append(class2,Y[jj,:].reshape((1,2)) 
,axis=0) 


classl=np.delete(classl, (0),axis=0) # remove initial 
class2=np.delete(class2, (0),axis=0) 


Figures [5.10|and|5.11|show the data generated from two distinct Gaussian 
distributions. In this case, we have ground-truth data to check the k-means 
clustering against. In general, this is not the case. The Lloyd algorithm guesses 
the number of clusters and the initial cluster means, and then proceeds to up- 
date them in an iterative fashion. Thus, k-means is sensitive to the initial guess 
and many modern versions of the algorithm also provide principled strategies 
for initialization. 

Figure shows the iterative procedure of the k-means clustering. The 
two initial guesses are used to initially label all the data points (Fig. [5.10{a)). 
New means are computed and the data relabeled. After only four iterations, 
the clusters converge. This algorithm was explicitly developed here to show 
how the iteration procedure rapidly provides an unsupervised labeling of all 
of the data. MATLAB has a built-in k-means algorithm that only requires a data 
matrix and the number of clusters desired. It is simple to use and provides a 
valuable diagnostic tool for data. The following code uses the MATLAB com- 
mand kmeans and also extracts the decision line generated from the algorithm 
separating the two clusters. 


Code 5.6: [MATLAB] k-means using MATLAB. 


|| (ind, c]=kmeans (Y, 2); 
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Figure 5.10: Illustration of the k-means iteration procedure based upon Lloyd’s 
algorithm [452]. Two clusters are sought so that k = 2. The initial guesses (black 
circles in panel (a)) are used to initially label all the data according to their dis- 
tance from each initial guess for the mean. The means are then updated by 
computing the means of the newly labeled data. This two-stage heuristic con- 
verges after approximately four iterations. 


Code 5.6: [Python] k-means using Python. 


|| kmeans = KMeans(n_clusters=2, random_state=0) .fit (Y) 


Figure 5.11|shows the results of the k-means algorithm and depicts the de- 
cision line separating the data into two clusters. The green and magenta balls 
denote the true labels of the data, showing that the k-means line does not cor- 
rectly extract the labels. Indeed, a supervised algorithm is more proficient in 
extracting the ground-truth results, as will be shown later in this chapter. Re- 
gardless, the algorithm does get a majority of the data labeled correctly. 

The success of k-means is based on two factors: (i) no supervision is re- 
quired, and (ii) it is a fast heuristic algorithm. The example here shows that the 
method is not very accurate, but this is often the case in unsupervised methods, 
as the algorithm has limited knowledge of the data. Cross-validation efforts, 
such as k-fold cross-validation, can help improve the model and make the un- 
supervised learning more accurate, but it will generally be less accurate than a 
supervised algorithm that has labeled data. 
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Training data 


Test data 


(a) 


33 0 2 4 
Figure 5.11: The k-means clustering of the data using MATLAB’s kmeans com- 
mand. Only the data and number of clusters need be specified. (a) The training 
data is used to produce a decision line (black line) separating the clusters. Note 
that the line is clearly not optimal. The classification line can then be used on 
withheld data to test the accuracy of the algorithm. For the test data, one ma- 
genta ball (of 50) would be mislabeled, while six green balls (of 50) are misla- 
beled. 


5.4 Unsupervised Hierarchical Clustering: Dendro- 
gram 


Another commonly used unsupervised algorithm for clustering data is a den- 
drogram. Like k-means clustering, dendrograms are created from a simple hi- 
erarchical algorithm, allowing one to efficiently visualize if data is clustered 
without any labeling or supervision. This hierarchical approach will be applied 
to the data illustrated in Fig. where a ground truth is known. Hierarchi- 
cal clustering methods are generated from either a top-down or a bottom-up 
approach. Specifically, they are one of two types: 


Agglomerative. Each data point x; is its own cluster initially. The data is merged 
in pairs as one creates a hierarchy of clusters. The merging of data eventually 
stops once all the data has been merged into a single über cluster. This is the 
bottom-up approach in hierarchical clustering. 


Divisive. In this case, all the observations x; are initially part of a single giant 
cluster. The data is then recursively split into smaller and smaller clusters. The 
splitting continues until the algorithm stops according to a user-specified ob- 
jective. The divisive method can split the data until each data point is its own 
node. 


In general, the merging and splitting of data is accomplished with a heuris- 
tic, greedy algorithm, which is easy to execute computationally. The results of 
hierarchical clustering are usually presented in a dendrogram. 
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Figure 5.12: Example data used for construction of a dendrogram. The data is 
constructed from two Gaussian distributions (50 points each) that are easy to 
discern through a visual inspection. The dendrogram will produce a hierarchy 
that ideally would separate green balls from magenta balls. 


In this section, we will focus on agglomerative hierarchical clustering and 
the dendrogram command from MATLAB. Like the Lloyd algorithm for k- 
means clustering, building the dendrogram proceeds from a simple algorith- 
mic structure based on computing the distance between data points. Although 
we typically use a Euclidean distance, there are a number of important distance 
metrics one might consider for different types of data. Some typical distances 
are given as follows: 


Euclidean distance lx; — Xzll2, (5.14a) 
squared Euclidean distance ||x; — xz||3, (5.14b) 
Manhattan distance x; — Sella, (5.14c) 
maximum distance x; — Xz|loo, (5.14d) 
Mahalanobis distance y (xj — Xp)T CTI (X; — Xx), (5.14e) 


where C~! is the covariance matrix. As already illustrated in the previous chap- 
ter, the choice of norm can make a tremendous difference for exposing patterns 
in the data that can be exploited for clustering and classification. 

The dendrogram algorithm is shown in Fig. The algorithm is as fol- 
lows: (i) Compute the distance between all m data points x; (Fig. illus- 
trates the use of a Euclidian distance). (ii) Merge the closest two data points 
into a single new data point midway between their original locations. (iii) Re- 
peat the calculation with the new m — 1 points. The algorithm continues until 
the data has been hierarchically merged into a single data point. 
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Figure 5.13: Illustration of the agglomerative hierarchical clustering scheme ap- 
plied to four data points. In the algorithm, the distances between the four data 
points are computed. Initially, the Euclidean distance between points 2 and 
3 is the least. Points 2 and 3 are thus merged into a point midway between 
them, and the distances are once again computed. The dendrogram on the right 
shows how the process generates a summary (dendrogram) of the hierarchical 
clustering. Note that the length of the branches of the dendrogram tree are di- 
rectly related to the distance between the merged points. 


The following code performs a hierarchical clustering using the dendrogram 
command from MATLAB. The example we use is the same as that considered 
for k-means clustering. Figure|5.12|shows the data under consideration. Visual 
inspection shows two clear clusters that are easily discernible. As with k-means, 
our goal is to see how well a dendrogram can extract the two clusters. 


Code 5.7: [MATLAB] Dendrogram for unsupervised clustering. 


=i (eal, (GIGS S Osa) on ee OEO as. allie 
Y2 = pdist (Y3,’euclidean’ ); 
Z = linkage (Y2,’average’); 


thresh=0.85*max(Z(:,3)); 
[H, T,O]=dendrogram(Z,100,’ColorThreshold’,thresh) ; 


Code 5.7: [Python] Dendrogram for unsupervised clustering. 
Y2 = pdist (Y3,metric=’ euclidean’ ) 
Z = hierarchy. linkage (Y2,method=’ average’ ) 
thresh = 0.85*np.max(Z[:,2]) 
dn = hierarchy.dendrogram(Z, p=100, color_threshold=thresh) 
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Figure 5.14: Dendrogram structure produced from the data in Fig. The 
dendrogram shows which points are merged as well as the distance between 
points. Two clusters are generated for this level of threshold. 


Figure|5.14|/shows the dendrogram associated with the data in Fig. The 
structure of the algorithm shows which points are merged as well as the dis- 
tance between points. The threshold command is important in labeling where 
each point belongs in the hierarchical scheme. By setting the threshold at differ- 
ent levels, there can be more or fewer clusters in the dendrogram. The output 
of the dendrogram is used to show how the data was labeled. Recall that the 
first 50 data points are from the green cluster and the second 50 data points are 
from the magenta cluster. 

Figure shows how the data was clustered in the dendrogram. If per- 
fect clustering had been achieved, then the first 50 points would have been 
below the horizontal dotted red line while the second 50 points would have 
been above the horizontal dotted red line. The vertical dotted red line is the 
line separating the green dots on the left from the magenta dots on the right. 

A greater number of clusters are generated by adjusting the threshold in 
the dendrogram command. This is equivalent to setting the number of clus- 
ters in k-means to something greater than two. Recall that one rarely has a 
ground truth to compare with when doing unsupervised clustering, so tuning 
the threshold becomes important. 

Figure 5.16]shows a new dendrogram with a different threshold. Note that 
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Figure 5.15: Clustering outcome from dendrogram routine. This is a summary 
of Fig. showing how each of the points was clustered through the dis- 
tance metric. The horizontal red dotted line shows where the ideal separation 
should occur. The first 50 points (green dots of Fig. should be grouped 
so that they are below the red horizontal line in the lower left quadrant. The 
second 50 points (magenta dots of Fig. should be grouped above the red 
horizontal line in the upper right quadrant. In summary, the dendrogram only 
misclassified two green points and two magenta points. 


in this case, the hierarchical clustering produces more than a dozen clusters. 
The tuning parameter can be seen to be critical for unsupervised clustering, 
much like choosing the number of clusters in k-means. In summary, both k- 
means and hierarchical clustering provide a method whereby data can be parsed 
automatically into clusters. This provides a starting point for interpretations 
and analysis in data mining. 


5.5 Mixture Models and the Expectation-Maximization 
Algorithm 

The third unsupervised method we consider is known as finite mixture models. 

Often the models are assumed to be Gaussian distributions, in which case this 


method is known as Gaussian mixture models (GMM). The basic assumption in 
this method is that data observations x; are a mixture of a set of k processes 
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Figure 5.16: Dendrogram structure produced from the data in Fig. [5.12] with a 
different threshold used than in Fig. The dendrogram shows which points 
are merged as well as the distance between points. In this case, more than a 
dozen clusters are generated. 


that combine to form the measurement. Like k-means and hierarchical cluster- 
ing, the GMM model we fit to the data requires that we specify the number of 
mixtures k and the individual statistical properties of each mixture that best fit 
the data. GMMs are especially useful since the assumption that each mixture 
model has a Gaussian distribution implies that it can be completely character- 
ized by two parameters: the mean and the variance. 

The algorithm that enables the GMM computes the maximum-likelihood 
using the famous expectation-maximization (EM) algorithm of Dempster, Laird, 
and Rubin [200]. The EM algorithm is designed to find maximum-likelihood pa- 
rameters of statistical models. Likelihood is a fundamental concept of statistics 
and probability theory [453]. Although not covered here, it provides the 
mathematical construct for the EM algorithm and GMM. Generally, the iter- 
ative structure of the algorithm finds a local maximum likelihood, which esti- 
mates the true parameters that cannot be directly solved for. As with most data, 
the observed data involves many latent or unmeasured variables and unknown 
parameters. Regardless, the alternating and iterative construction of the algo- 
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rithm recursively estimates the best parameters possible from an initial guess. 
The EM algorithm proceeds like the k-means algorithm in that initial guesses 
for the mean and variance are given for the assumed k-distributions. The algo- 
rithm then recursively updates the weights of the mixtures versus the param- 
eters of each mixture. One alternates between these two until convergence is 
achieved. 

In any such iteration scheme, it is not obvious that the solution will con- 
verge, or that the solution is good, since it typically falls into a local value of 
the maximum likelihood. But it can be proven that in this context it does con- 
verge, and that the derivative of the likelihood is arbitrarily close to zero at 
that point, which in turn means that the point is either a maximum or a saddle 
point [763]. In general, multiple maxima may occur, with no guarantee that the 
global maximum will be found. Some likelihoods also have singularities, i.e., 
nonsensical maxima. For example, one of the solutions that may be found by 
EM in a mixture model involves setting one of the components to have zero 
variance and the mean equal to one of the data points. Cross-validation can 
often alleviate some of the common pitfalls that can occur by initializing the 
algorithm with some bad initial guesses. 

The fundamental assumption of the mixture model is that the probability 
density function (PDF) for observations of data x; is a weighted linear sum of 
a set of unknown distributions, 


k 
f(xj,0) = Sap fp(x;, Op); (5.15) 


where f(-) is the measured PDF, f,,(-) is the PDF of the mixture p, and k is the 
total number of mixtures. Each of the PDFs f,(-) is weighted by a, (with a; + 
a2 +: + a, = 1) and parameterized by an unknown vector of parameters ©,. 
To state the objective of mixture models more precisely then: Given the observed 
PDF f(x;, ©), estimate the mixture weights a, and the parameters of the distribution 
©,. Note that © is a vector containing all the parameters ©,. Making this task 
somewhat easier is the fact that we assume the form of the PDF distribution 
tp ( : ). 

For GMM, the parameters in the vector ©, are known to include only two 
variables: the mean u, and variance cp. Moreover, the distribution /,(-) is nor- 
mally distributed, so that becomes 


k 
fq, O) =) pA t) (5.16) 
p=1 


This gives a much more tractable framework since there is now a limited set 
of parameters. Thus, once one assumes a number of mixtures k, then the task 
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is to determine a, along with up and o, for each mixture. It should be noted 
that there are many other distributions besides Gaussian that can be imposed, 
but GMM are common since, without prior knowledge, an assumption of a 
Gaussian distribution is typically assumed. 

Anestimate of the parameter vector © can be computed using the maximum- 
likelihood estimate (MLE) of Fisher. The MLE computes the value of © from the 


roots of 
OL(O) 


ow 17 
where the log-likelihood function L is 
L(®) = $ log f(x;|@) (5.18) 


j=l 


and the sum is over all the n data vectors x;. The solution to this optimiza- 
tion problem, i.e., when the derivative is zero, produces a local maximizer. This 
maximizer can be computed using the EM algorithm since derivatives cannot 
be explicitly computed without an analytic form. 

The EM algorithm starts by assuming an initial estimate (guess) of the pa- 
rameter vector ©. This estimate can be used to estimate 


ap fr(Xj, Op) 
fx) ° 


which is the posterior probability of component membership of x; in the pth 
distribution. In other words, does x; belong to the pth mixture? The E step of 
the EM algorithm uses this posterior to compute memberships. For GMM, the 
algorithm proceeds as follows: Given an initial parameterization of © and a,, 
compute 


Tp(xj;,O) = (5.19) 


(k) _ Ck) (5) 
7) (x) = Qp N (Xj, Hp 0p ) (5.20) 


f(x;,0®) 
With an estimated posterior probability, the M step of the algorithm then up- 
dates the parameters and mixture weights, 


1 
ok) =- 3 r (x;), (5.21a) 
j= 
ý (k) 
xT” (x 
(k+1) _ 2e itp ©) (5.21b) 


* (k) (k+1) (k+) T 
J TT AKOE = Xj— u 
DED im a poe a (5.21c) 


Deg 7 (x5) 
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where the matrix 3 tD is the covariance matrix containing the variance pa- 


rameters. The E and M steps are alternated until convergence within a specified 
tolerance. Recall that, to initialize the algorithm, the number of mixture models 
k must be specified and initial parameterization (guesses) of the distributions 
given. This is similar to the k-means algorithm, where the number of clusters k 
is prescribed and an initial guess for the cluster centers is specified. 

The GMM is popular since it simply fits k Gaussian distributions to data, 
which is reasonable for unsupervised learning. The GMM algorithm also has 
a stronger theoretical base than most unsupervised methods, as both k-means 
and hierarchical clustering are simply defined as algorithms. The primary as- 
sumption in GMM is the number of clusters and the form of the distribution 
FO. 

The following code executes a GMM model on the second and fourth prin- 
cipal components of the dogs and cats wavelet image data introduced previ- 
ously in Figs. Thus the features are the second and fourth columns of 
the right singular vector of the SVD. The fitgmdist command is used to extract 
the mixture model. 


Code 5.8: [MATLAB] Gaussian mixture model for cats versus dogs. 
dogcat=v(:,2:2:4); 
GMModel=fitgmdist (dogcat, 2) 
ATC- GMModel.AIC 


Code 5.8: [Python] Gaussian mixture model for cats versus dogs. 


GMModel = GaussianMixture (n_components=2) .fit (dogcat) 
AIC = GMModel.aic(dogcat) 


The results of the algorithm can be plotted for visual inspection, and the 
parameters associated with each Gaussian are given: specifically, the mixing 
proportion of each model along with the mean in each of the two dimensions 
of the feature space. The following is displayed to the screen. 


Component 1: 
Mixing proporelon: 0355535 
Mean: 70770290 — OOS S 


Component 2: 
Mixing proportion: 0.644465 


Mean: O07 S58 0.0076 
ARCET 
=) otek OS 
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Figure 5.17: GMM fit of the second and fourth principal components of the 
dogs and cats wavelet image data. The two Gaussians are well placed over the 
distinct dogs and cats features as shown in (a). The PDF of the Gaussian models 
extracted are highlighted in (b) in arbitrary units. 


The code can also produce an AIC score for how well the mixture of Gaussians 
explains the data. This gives a principled method for cross-validating in order 
to determine the number of mixtures required to describe the data. 

Figure shows the results of the GMM fitting procedure along with the 
original data of cats and dogs. The Gaussians produced from the fitting proce- 
dure are also illustrated. The fitgmdist command can also be used with cluster 
to label new data from the feature separation discovered by GMM. 


5.6 Supervised Learning and Linear Discriminants 


We now turn our attention to supervised learning methods. One of the earliest 
supervised methods for classification of data was developed by Fisher in 1936 
in the context of taxonomy [245]. His linear discriminant analysis (LDA) is still 
one of the standard techniques for classification. It was generalized by C. R. Rao 
for multi-class data in 1948 [586]. The goal of these algorithms is to find a linear 
combination of features that characterizes or separates two or more classes of 
objects or events in the data. Importantly, for this supervised technique we have 
labeled data that guides the classification algorithm. Figure .18|illustrates the 
concept of finding an optimal low-dimensional embedding of the data for clas- 
sification. The LDA algorithm aims to solve an optimization problem to find 
a subspace whereby the different labeled data have clear separation between 
their distributions of points. This then makes classification easier because an 
optimal feature space has been selected. 

The supervised learning architecture includes a training set and a withhold 
set of data. The withhold set is never used to train the classifier. However, the 
training data can be partitioned into k folds, for instance, to help build a bet- 
ter classification model. The last chapter details how cross-validation should 
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Figure 5.18: Illustration of linear discriminant analysis (LDA). The LDA opti- 
mization method produces an optimal dimensionality reduction to a decision 
line for classification. The figure illustrates the projection of data onto the sec- 
ond and fourth principal component modes of the dogs and cats wavelet data 
considered in Fig. Without optimization, a general projection can lead to 
very poor discrimination between the data. However, the LDA separates the 
probability density functions in an optimal way. 


be appropriately used. The goal here is to train an algorithm that uses feature 
space to make a decision about how to classify data. Figure[.18] gives a cartoon 
of the key idea involved in LDA. In our example, two data sets are considered 
and projected onto new bases. On the left-hand side, the projection shows that 
the data is completely mixed, making it difficult to separate the data. On the 
right-hand side, which is the ideal caricature for LDA, the data are well sep- 
arated, with the means ju; and u2 being well apart when projected onto the 
chosen subspace. Thus the goal of LDA is two-fold: find a suitable projection that 
maximizes the distance between the inter-class data while minimizing the intra-class 
data. 

For a two-class LDA, this results in the following mathematical formulation. 
Construct a projection w such that 


w! Spw 


(5.22) 


w = arg max ————— 
w wSyww’ 


where the scatter matrices for between-class Sz and within-class Sw data are 
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given by 
Sg = (p2 — p) (u2 — p1)”, (5.23) 
Se= > hEn) (5.24) 
j=1 xeD; 


These quantities essentially measure the variance of the data sets as well as 
the variance of the difference in the means. The criterion in is commonly 
known as the generalized Rayleigh quotient, whose solution can be found via 
the generalized eigenvalue problem 


where the maximum eigenvalue à and its associated eigenvector give the quan- 
tity of interest and the projection basis. Thus, once the scatter matrices are con- 
structed, the generalized eigenvectors can be constructed with MATLAB. 
Performing an LDA analysis in MATLAB is simple. One needs only to or- 
ganize the data into a training set with labels, which can then be applied to a 


test data set. Given a set of data x; for j = 1, 2, . . ., m with corresponding labels 
y;, the algorithm will find an optimal classification space as shown in Fig. 
New data x, with k = m + 1,m + 2,...,m +n can then be evaluated and la- 


beled. We illustrate the classification of data using the dogs and cats data set 
introduced in the feature section of this chapter. Specifically, we consider the 
dogs and cats images in the wavelet domain and label them so that y; € {+1} 
(where y; = 1 isa dog and y; = —1 is a cat). The following code trains on the 
first 60 images of dogs and cats, and then tests the classifier on the remaining 20 
dogs and cats images. For simplicity, we train on the second and fourth prin- 
cipal components, as these show good discrimination between dogs and cats 


(see Fig. (5.5). 
Code 5.9: [MATLAB] LDA analysis of dogs versus cats. 


|| class=classify (test, xtrain, label); 


Code 5.9: [Python] LDA analysis of dogs versus cats. 


lda = LinearDiscriminantAnalysis() 
test_class = lda.fit(xtrain, label) .predict (test) 


Note that the classify command in MATLAB takes in the three matrices of 
interest: the training data, the test data, and the labels for the training data. 
What is produced are the labels for the test set. One can also extract from this 
command the decision line for online use. Figure [5.19]shows the results of the 
classification on the 40 test data samples. Recall that this classification is per- 
formed using only the second and fourth PCA modes, which cluster as shown 
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Figure 5.19: Depiction of the performance achieved for classification using the 
second and fourth principal component modes. The top two panels are PCA 
modes (features) used to build a classifier. The labels returned are either y; € 
{+1}. The ground-truth answer in this case should produce a vector of 20 ones 
followed by 20 negative ones. 


in Fig. The returned labels are either +1 depending on whether a cat or 
a dog is labeled. The ground-truth labels for the test data should return a +1 
(dogs) for the first 20 test sets and a —1 (cats) for the second test set. The accu- 
racy of classification for this realization is 82.5% (2/20 cats are mislabeled while 
5/20 dogs are mislabeled). Comparing the wavelet images to the raw images, 
we see that the feature selection in the raw images is not as good. In particu- 
lar, for the same two principal components, 9/20 cats are mislabeled and 4/20 
dogs are mislabeled. Of course, the data is fairly limited and cross-validation 
should always be performed to evaluate the classifier. We run 100 trials of the 
classify command where 60 dogs and cats images are randomly selected and 
tested against the remaining 20 images. 

Figure shows the results of the cross-validation over 100 trials. Note 
the variability that can occur from trial to trial. Specifically, the performance can 
achieve 100%, but can also be as low as 40%, which is worse than a coin flip. The 
average classification score (red dotted line) is around 70%. Cross-validation, as 
already highlighted in the regression chapter, is critical for testing and robus- 
tifying the model. Recall that the methods for producing a classifier are based 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


232 CHAPTER 5. CLUSTERING AND CLASSIFICATION 


100 


Average 


Accuracy 
oi 
oO 


0 20 40 60 80 . 100 
Trials 


Figure 5.20: Performance of the LDA over 100 trials. Note the variability that 
can occur in the classifier depending on which data is selected for training and 
testing. This highlights the importance of cross-validation for building a robust 
classifier. 


on optimization and regression, so that all the cross-validation methods can be 
ported to the clustering and classification problem. 

In addition to a linear discriminant line, a quadratic discriminant line can be 
found to separate the data. Indeed, the classify command in MATLAB allows 
one to not only produce the classifier, but also extract the line of separation 
between the data. 

Figure[5.21|shows the dogs and cats data along with the linear and quadratic 
lines separating them. This linear or quadratic fit is found in the structured vari- 
able coeff which is returned with classify. The quadratic line of separation can 
often offer a little more flexibility when trying to fit boundaries separating data. 
A major advantage of LDA-based methods is that they are easily interpretable 
and easy to compute. Thus, they are widely used across many branches of the 
sciences for classification of data. 


5.7 Support Vector Machines (SVM) 


One of the most successful data-mining methods developed to date is the sup- 
port vector machine (SVM). It is a core machine learning tool that is used widely 
in industry and science, often providing results that are better than competing 
methods. Along with the random forest algorithm, they have been pillars of ma- 
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PCA? PCAs 
Figure 5.21: Classification line for (a) linear discriminant analysis (LDA) and 
(b) quadratic discriminant analysis (QDA) for dog (green dots) versus cat (ma- 
genta dots) data projected onto the second and fourth principal components. 
This two-dimensional feature space allows for a good discrimination in the 
data. The two lines represent the best line and parabola for separating the data 
for a given training sample. 


chine learning in the last few decades. With enough training data, the SVM can 
now be replaced with deep neural nets. But otherwise, SVM and random for- 
est are frequently used algorithms for applications where the best classification 
scores are required. 

The original SVM algorithm by Vapnik and Chervonenkis evolved out of 
the statistical learning literature in 1963, where hyperplanes are optimized to 
split the data into distinct clusters. Nearly three decades later, Boser, Guyon and 
Vapnik created nonlinear classifiers by applying the kernel trick to maximum- 
margin hyperplanes [98]. The current standard incarnation (soft margin) was 
proposed by Cortes and Vapnik in the mid-1990s [184]. 


Linear SVM 
The key idea of the linear SVM method is to construct a hyperplane 


w-x+b=0, (5.26) 


where the vector w and constant b parameterize the hyperplane. Figure [5.22] 
shows two potential hyperplanes splitting a set of data. Each has a different 
value of w and constant b. The optimization problem associated with SVM is 
not only to optimize a decision line that makes the fewest labeling errors for 
the data, but also to optimize the largest margin between the data, shown in 
the shaded regions of Fig. The vectors that determine the boundaries of 
the margin, i.e., the vectors touching the edge of the shaded regions, are termed 
the support vectors. Given the hyperplane (5.26), a new data point x; can be clas- 
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w-x+b=0 
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Figure 5.22: The SVM classification scheme constructs a hyperplane w-x+b = 0 
that optimally separates the labeled data. The area of the margin separating the 
labeled data is maximal in (a) and much less in (b). Determining the vector w 
and parameter b is the goal of the SVM optimization. Note that for data to the 
right of the hyperplane w - x + b > 0, while for data to the left w -x +b < 0. 
Thus the classification labels y; € {+1} for the data to the left or right of the 
hyperplane is given by y;(w - x; +b) = sign(w - x; + b). So only the sign of 
w:x +b needs to be determined in order to label the data. The vectors touching 
the edge of the shaded regions are termed the support vectors. 


sified by simply computing the sign of (w - x; +b). Specifically, for classification 
labels y; € {+1}, the data to the left or right of the hyperplane is given by 


+1 magenta ball, 


—1 green ball. (5.27) 


yj(w-x; +b) =sign(w-x,;+)) = l 
Thus the classifier y; is explicitly dependent on the position of x;. 

Critical to the success of the SVM is determining w and b in a principled 
way. As with all machine learning methods, an appropriate optimization must 
be formulated. The optimization is aimed at both minimizing the number of 
misclassified data points as well as creating the largest margin possible. To con- 
struct the optimization objective function, we define a loss function 


0 ify; =sign(w-x; +0), 


L(y; Yj) = Ly; sign(w - x; + b)) = F ify Seon ek ee). (5.28) 
Stated more simply, 
_, J O if data is correctly labeled, 
Uy 5295) = E if data is incorrectly labeled. (9123) 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


5.7. SUPPORT VECTOR MACHINES (SVM) 235 


Thus each mislabeled point produces a loss of unity. The training error over m 
data points is the sum of the loss functions £(y;, y;). 

In addition to minimizing the loss function, the goal is also to make the 
margin as large as possible. We can then frame the linear SVM optimization 
problem as 


m 


1 
argmin X` L(y;, Yy) + 5llwll? subject to min|x;-w| = 1. (5.30) 
j 


w,b j=l 
Although this is a concise statement of the optimization problem, the fact that 
the loss function is discrete and constructed from ones and zeros makes it very 
difficult to actually optimize. Most optimization algorithms are based on some 
form of gradient descent, which requires smooth objective functions in order 
to compute derivatives or gradients to update the solution. A more common 
formulation then is given by 


= 1 
argmin X > H(y;,¥;) + ziw? subjectto min |x; -w| = 1, (5.31) 
w,b : J 
7 j=l 

where H(y;,y;) = max(0, 1 — y; - y;) is called a Hinge loss function. This is 
a smooth function that counts the number of errors in a linear way and that 
allows for piecewise differentiation so that standard optimization routines can 
be employed. 


Nonlinear SVM 


Although easily interpretable, linear classifiers are of limited value. They are 
simply too restrictive for data embedded in a high-dimensional space and which 
may have the structured separation as illustrated in Fig. To build more so- 
phisticated classification curves, the feature space for SVM must be enriched. 
SVM does this by including nonlinear features and then building hyperplanes 
in this new space. To do this, one simply maps the data into a nonlinear, higher- 
dimensional space 

x +> B(x). (5.32) 


We can call the ®(x) new observables of the data. The SVM algorithm now learns 
the hyperplanes that optimally split the data into distinct clusters in a new 
space. Thus one now considers the hyperplane function 


f(x) =w- ®(x) +), (5.33) 


with corresponding labels y; € {+1} for each point f(x;). 
This simple idea, of enriching feature space by defining new functions of the 
data x, is exceptionally powerful for clustering and classification. As a simple 
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Figure 5.23: The nonlinear embedding of Fig. [5.8{b) using the variables 
(£1, £2) + (21, 22, 23) := (£1, £2, £? + 73) in (5.34). A hyperplane can now easily 
separate the green from magenta balls, showing that linear classification can be 
accomplished simply by enriching the measurement space of the data. Visual 
inspection alone suggests that nearly optimal separation can be achieved with 
the plane z3 ~ 14 (shaded gray plane). In the original coordinate system, this 
gives a circular classification line (black line on the plane zı versus x2) with 
radius r = \/z3 = y£? + x2 ~ v14. This example makes it obvious how a hy- 
perplane in higher dimensions can produce curved classification lines in the 
original data space. 


example, consider two-dimensional data x = (x1, x2). One can easily enrich the 
space by considering polynomials of the data: 


(£1, £2) > (21, 20, 23) = (£1, Z2, 27 + T2). (5.34) 


This gives a new set of polynomial coordinates in x; and x, that can be used to 
embed the data. This philosophy is simple: by embedding the data in a higher- 
dimensional space, it is much more likely to be separable by hyperplanes. As 
a simple example, consider the data illustrated in Fig. [5.8{b). A linear classifier 
(or hyperplane) in the x;—x2 plane will clearly not be able to separate the data. 
However, the embedding projects into a three-dimensional space, which 
can be easily separated by a hyperplane as illustrated in Fig. 
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The ability of SVM to embed in higher-dimensional nonlinear spaces makes it one 
of the most successful machine learning algorithms developed. The underlying opti- 
mization algorithm remains unchanged, except that the previous labeling 
function y; = sign(w - x; +b) is now 


y; = sign(w - ®(x;) +b). (5.35) 


The function ®(x) specifies the enriched space of observables. As a general 
rule, more features are better for classification. 


Kernel Methods for SVM 


Despite its promise, the SVM method of building nonlinear classifiers by en- 
riching in higher dimensions leads to a computationally intractable optimiza- 
tion. Specifically, the large number of additional features leads to the curse of 
dimensionality. Thus computing the vectors w is prohibitively expensive and 
may not even be represented explicitly in memory. The kernel trick solves this 
problem. In this scenario, the w vector is represented as 


w= > a; P(x;), (5.36) 


where a; are parameters that weight the different nonlinear observable func- 
tions ®(x,). Thus the vector w is expanded in the observable set of functions. 
We can then generalize (5.33) to the following: 


f(x)= 2 aj®(x;) - B(x) +b. (5.37) 


The kernel function [643] is then defined as 
K(x;, x) = ®(x,;) - (x). (5.38) 


With this new definition of w, the optimization problem (5.31) becomes 


> ua) 


where a is the vector of a; coefficients that must be determined in the min- 
imization process. There are different conventions for representing the mini- 
mization. However, in this formulation, the minimization is now over a instead 
of w. 

In this formulation, the kernel function K (x;, x) essentially allows us to rep- 
resent Taylor series expansions of a large (infinite) number of observables in 


2 
+ ee 
argmin ) | A(y5,¥;) + J 


subjectto min |x; -w| = 1, (5.39) 
J 
j=l 
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a compact way [643]. The kernel function enables one to operate in a high- 
dimensional, implicit feature space without ever computing the coordinates 
of the data in that space, but rather by simply computing the inner products 
between all pairs of data in the feature space. For instance, two of the most 
commonly used kernel functions are 


radial basis function (RBF): K(x;, x) = exp(—7||x,; — xl”), (5.40a) 
polynomial kernel: K(x;,x) = (x;-x+1)%, (5.40b) 


where N is the degree of polynomials to be considered, which is exceptionally 
large to evaluate without using the kernel trick, and y is the width of the Gaus- 
sian kernel measuring the distance between individual data points x; and the 
classification line. These functions can be differentiated in order to optimize 
(5.39). 

This represents the major theoretical underpinning of the SVM method. It 
allows us to construct higher-dimensional spaces using observables generated 
by kernel functions. Moreover, it results in a computationally tractable opti- 
mization. The following code shows the basic workings of the kernel method 
on the example of dogs and cats classification data. In the first example, a stan- 
dard linear SVM is used, while in the second, the RBF is executed as an option. 


Code 5.10: [MATLAB] SVM classification. 


Mdl = fitcsvm(xtrain, label); 

test_labels = predict (Mdl,test); 

Maly = fr besvm(xtrain, label, Kernel Function!’ , RBE); 
test_labels = predict (Mdl,test); 

CMdl = crossval (Mdai); % cross-validate the model 
classLoss = kfoldLoss (CMd1) s compute class loss 


Code 5.10: [Python] SVM classification. 


Mdl = svm.SVC(kernel=’ rbf’,gamma=’ auto’) .fit (xtrain, label) 
test_labels = Mdl.predict (test) 


CMdl = cross_val_score (Md 
validate the model 


l, xtrain, label, cv=10) #cross-— 


classLoss = 1-np.mean(CMdl) # average error over all cross- 
validation iterations 


Note that in this code we have demonstrated some of the diagnostic features 
of the SVM method in MATLAB, including the cross-validation and class loss 
scores that are associated with training. This is a superficial treatment of the 
SVM. Overall, SVM is one of the most sophisticated machine learning tools in 
MATLAB, and there are many options that can be executed in order to tune 
performance and extract accuracy /cross-validation metrics. 
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5.8 Classification Trees and Random Forest 


Decision trees are common in business. They establish an algorithmic flow 
chart for making decisions based on criteria that are deemed important and re- 
lated to a desired outcome. Often the decision trees are constructed by experts 
with knowledge of the workflow involved in the decision making process. De- 
cision tree learning provides a principled method based on data for creating a 
predictive model for classification and/or regression. Along with SVM, clas- 
sification and regression trees are core machine learning and data-mining al- 
gorithms used in industry, given their demonstrated success. The work of Leo 
Breiman and co-workers established many of the theoretical foundations 
exploited today for data mining. 

The decision tree is a hierarchical construct that looks for optimal ways to 
split the data in order to provide a robust classification and regression. It is the 
opposite of the unsupervised dendrogram hierarchical clustering previously 
demonstrated. In this case, our goal is not to move from bottom up in the clus- 
tering process, but from top down in order to create the best splits possible 
for classification. The fact that it is a supervised algorithm, which uses labeled 
data, allows us to split the data accordingly. 

There are significant advantages in developing decision trees for classifi- 
cation and regression: (i) they often produce interpretable results that can be 
graphically displayed, making them easy to interpret even for non-experts; 
(ii) they can handle numerical or categorical data equally well; (iii) they can 
be statistically validated so that the reliability of the model can be assessed; 
(iv) they perform well with large data sets at scale; and (v) the algorithms mir- 
ror human decision making, again making them more interpretable and useful. 

As one might expect, the success of decision tree learning has produced a 
large number of innovations and algorithms for how to best split the data. The 
coverage here will be limited, but we will highlight the basic architecture for 
data splitting and tree construction. Recall that we have the following: 


data {x; € R”, j € Z := {1,2,...,m}}, (5.41a) 
labels {y; E€ {41}, j E Z/C Z}. (5.41b) 


The basic decision tree algorithm is fairly simple: (i) Scan through each compo- 
nent (feature) x, (with k = 1,2,...,n) of the vector x; to identify the value of x; 
that gives the best labeling prediction for y;. (ii) Compare the prediction accu- 
racy for each split on the feature x;. The feature giving the best segmentation 
of the data is selected as the split for the tree. (iii) With the two new branches 
of the tree created, this process is repeated on each branch. The algorithm ter- 
minates once each individual data point is a unique cluster, known as a leaf, on 
a new branch of the tree. This is essentially the inverse of the dendrogram. 
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Figure 5.24: Illustration of the splitting procedure for decision tree learning per- 
formed on the Fisher iris data set. Each variable xı through x4 is scanned over 
to determine the best split of data which retains the best correct classification 
of the labeled data in the split. The variable x3 = 2.35 provides the first split in 
the data for building a classification tree. This is followed by a second split at 
x4 = 1.75 and a third split at 73 = 4.95. Only three splits are shown. The classifi- 
cation tree after three splits is shown in Fig. Note that, although the setosa 
data in the x; and x2 direction seems to be well separated along a diagonal line, 
the decision tree can only split along horizontal and vertical lines. 


As a specific example, consider the Fisher iris data set from Fig. For 
this data, each flower had four features (petal width and length, sepal width 
and length), and three labels (setosa, versicolor, and virginica). There were 50 
flowers of each variety for a total of 150 data points. Thus for this data the 
vector x; has the four components 


xı = sepal length, (5.42a) 
x2 = sepal width, (5.42b) 
x3 = petal length, (5.42c) 
x4 = petal width. (5.42d) 


The decision tree algorithm scans over these four features in order to decide 
how to best split the data. Figure |5.24|]shows the splitting process in the space 
of the four variables xı through 4. Illustrated are two data planes containing 
xı Versus xz (panel (b)) and x3 versus zx, (panel (a)). By visual inspection, one 
can see that the x3 (petal length) variable maximally separates the data. In fact, 
the decision tree performs the first split of the data at x; = 2.35. No further 
splitting is required to predict setosa, as this first split is sufficient. The variable 
x4 then provides the next most promising split at x4 = 1.75. Finally, a third 
split is performed at x3 = 4.95. Only three splits are shown. This process shows 
that the splitting procedure has an intuitive appeal, as the data splits optimally 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


5.8. CLASSIFICATION TREES AND RANDOM FOREST 241 


versicolor virginica 


Figure 5.25: Tree structure generated by the MATLAB fitctree command. Note 
that only three splits are conducted, creating a classification tree that produces 
a class error of 4.67%. 


separating the data are clearly visible. Moreover, the splitting does not occur 
on the x; and z (sepal width and length) variables as they do not provide a 
clear separation of the data. Figure|5.25|shows the tree used for Fig. 

The following code fits a tree to the Fisher iris data. Note that the fitctree 
command allows for many options, including a cross-validation procedure (used 
in the code) and parameter tuning (not used in the code). 


Code 5.11: [MATLAB] Decision tree classification of Fisher iris data. 


load fisheriris; 

tree=fitebree (meas, species, ’MaxNumSplats’ ,3,’CrossVal’ ," on" ) 
view (tree.Trained{1},’Mode’,’ graph’); 

classError = kfoldLoss (tree) 


Code 5.11: [Python] Decision tree classification of Fisher iris data. 


decision tree = tree.DecisionTreeClassifier (max_depth=3) .fit 
(meas, species_label) 
tree.export_graphviz (decision_tree, out_file=dot_data, 
filled=True, rounded=True, 
special_characters=True) 
graph = pydotplus.graph_from_dot_data (dot_data.getvalue() ) 
Image (graph.create_png() ) 


The results of the splitting procedure are demonstrated in Fig. The view 
command generates an interactive window showing the tree structure. The tree 
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x2 < 0.0490372 2 >= 0.0490372 


x4 < 0.0199106 4 >= 0.0199106 


Figure 5.26: Tree structure generated by the MATLAB fitctree command for 
dog versus cat data. Note that only two splits are conducted, creating a classi- 
fication tree that produces a class error of approximately 16%. 


can be pruned and other diagnostics are shown in this interactive graphic for- 
mat. The class error achieved for the Fisher iris data is 4.67%. 

As a second example, we construct a decision tree to classify dogs ver- 
sus cats using our previously considered wavelet images. Figure shows 
the resulting classification tree. Note that the decision tree learning algorithm 
identifies the first two splits as occurring along the x2 and z4 variables, respec- 
tively. These two variables have been considered previously since their his- 
tograms show them to be more distinguishable than the other PCA components 
(see Fig. [5.5p. For this splitting, which has been cross-validated, the class error 
achieved is approximately 16%, which can be compared with the 30% error of 
LDA. 

As a final example, we consider census data that is included in MATLAB. 
The following code shows some important uses of the classification and regres- 
sion tree architecture. In particular, the variables included can be used to make 
associations between relationships. In this case, the various data is used to pre- 
dict the salary data. Thus, salary is the outcome of the classification. Moreover, 
the importance of each variable and its relation to salary can be computed, as 
shown in Fig. The following code highlights some of the functionality of 
the tree architecture. 


Code 5.12: [MATLAB] Decision tree classification of census data. 
|| Load census1994 
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Figure 5.27: Importance of variables for prediction of salary data for the US 

census of 1994. The classification tree architecture allows for sophisticated treat- 

ment of data, including understanding how each variable contributes statisti- 

cally to predicting a classification outcome. 

X = adultidata(:, {| age’, workClass’, education num,” 

marital status, bace, Osez, Capital gain’, 
‘'capital_loss’,’hours_per_week’,’salary’}); 


Mdl = fitctree(X,’salary’,’PredictorSelection’,’curvature’,’ 
Surrogate; on), 


imp = predictorImportance (Mdl) ; 


bar(imp, PaceColor” -6 -6 36), Edgetolonr’ ko); 
title("Predictor Importance Estimates’); 

ylabel (’Estimates’); xlabel(’Predictors’); h = gca; 
h.XTickLabel = Mdl.PredictorNames; 
h.XTickLabelRotation = 45; 


Code 5.12: [Python] Decision tree classification of census data. 


Mdl = tree.DecisionTreeClassifier (max_features=10) .fit ( 
adultdata_input, adultdata_salary) 
imp = Mdl.feature_importances_ 
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infeatures = [’age’,’workClass’,’education_num’, ’ 
Martel Scars. ace. (ex T eapical ceniiny 7. 


capital_loss’,’hours_per_week’ ] 


As with the SVM algorithm, there exists a wide variety of tuning parame- 
ters for classification trees, and this is a superficial treatment. Overall, such 
trees are one of the most sophisticated machine learning tools in MATLAB and 
there are many options that can be executed to tune performance and extract 
accuracy /cross-validation metrics. 


Random Forest Algorithms 


Before closing this section, it is important to mention Breiman’s random forest 
|108] innovations for decision learning trees. Random forests, or random deci- 
sion forests, are an ensemble learning method for classification and regression. 
This is an important innovation, since the decision trees created by splitting are 
generally not robust to different samples of the data. Thus one can generate 
two significantly different classification trees with two subsamples of the data. 
This presents significant challenges for cross-validation. In ensemble learning, 
a multitude of decision trees are constructed in the training process. The ran- 
dom decision forests correct for a decision trees’ habit of overfitting to their 
training set, thus providing a more robust framework for classification. 

There are many variants of the random forest architecture, including vari- 
ants with boosting and bagging. These will not be considered here except to men- 
tion that the MATLAB figctree exploits many of these techniques through its 
options. One way to think about ensemble learning is that it allows for robust 
classification trees. It often does this by focusing its training efforts on hard- 
to-classify data instead of easy-to-classify data. Random forests, bagging, and 
boosting are all extensive subjects in their own right, but have already been 
incorporated into leading software offerings that build decision learning trees. 


5.9 Top 10 Algorithms of Data Mining circa 2008 (Be- 
fore the Deep Learning Revolution) 


This chapter has illustrated the tremendous diversity of supervised and unsu- 
pervised methods available for the analysis of data. Although the algorithms 
are now easily accessible through many commercial and open-source software 
packages, the difficulty is now evaluating which method(s) should be used on 
a given problem. In December 2006, various machine learning experts attend- 
ing the IEEE International Conference on Data Mining (ICDM) identified the 
top 10 algorithms for data mining [764]. The identified algorithms were the fol- 
lowing: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, Naive 
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Bayes, and CART. These top 10 algorithms were identified at the time as being 
among the most influential data-mining algorithms in the research community. 
In the summary article, each algorithm was briefly described, along with its 
impact and potential future directions of research. The 10 algorithms covered 
classification, clustering, statistical learning, association analysis, and link min- 
ing, which are all among the most important topics in data-mining research 
and development. Interestingly, deep learning and neural networks, which are 
the topic of the next chapter, are not mentioned in the article. The landscape 
of data science would change significantly in 2012 with the ImageNet data set, 
and deep convolutional neural networks (CNN) began to dominate almost any 
meaningful metric for classification and regression accuracy. 

In this section, we highlight their identified top 10 algorithms and the ba- 
sic mathematical structure of each. Many of them have already been covered 
in this chapter. This list is not exhaustive, nor does it rank them beyond their 
inclusion in the top 10 list. Our objective is simply to highlight what was con- 
sidered by the community as the state-of-the-art data-mining tools in 2008. We 
begin with those algorithms already considered previously in this chapter. 


k-Means 


This is one of the workhorse unsupervised algorithms. As already demon- 
strated, the goal of k-means is simply to cluster by proximity to a set of k points. 
By updating the locations of the k points according to the mean of the points 
closest to them, the algorithm iterates to the k-means. The kmeans command 
takes in data X and the number of prescribed clusters k. It returns labels for 
each point, labels, along with their location, centers. 


EM (Mixture Models) 


Mixture models are the second workhorse algorithm for unsupervised learn- 
ing. The assumption underlying the mixture models is that the observed data is 
produced by a mixture of different probability density functions whose weight- 
ings are unknown. Moreover, the parameters must be estimated, thus requiring 
the expectation-maximization (EM) algorithm, where fitting produces Gaus- 
sian mixtures to the data X in k clusters. The Model output is a structured 
variable containing information on the probability distributions (mean, vari- 
ance, etc.) along with the goodness-of-fit. 


Support Vector Machine (SVM) 


One of the most powerful and flexible supervised learning algorithms used for 
most of the 1990s and 2000s, the SVM is an exceptional off-the-shelf method for 
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classification and regression. The main idea is to project the data into higher 
dimensions and split the data with hyperplanes. Critical to making this work 
in practice was the kernel trick for efficiently evaluating inner products of func- 
tions in higher-dimensional space, where the algorithm takes in labeled train- 
ing data denoted by train and label, and produces a structured output, Model. 
The structured output can be used along with the predict command to take 
test data, test, and produce labels (test_labels). There exist many options and 
tuning parameters for fitcsvm, making it one of the best off-the-shelf methods. 


CART (Classification and Regression Tree) 


This was the subject of the last section and was demonstrated to provide an- 
other powerful technique of supervised learning. The underlying idea was to 
split the data in a principled and informed way so as to produce an inter- 
pretable clustering of the data. The data splitting occurs along a single variable 
at a time to produce branches of the tree structure, where the algorithm takes 
in labeled training data denoted by train and label, and produces a structured 
output, tree. There are many options and tuning parameters for fitctree, mak- 
ing it one of the best off-the-shelf methods. 


k-Nearest Neighbors (kKNN) 


This is perhaps the simplest supervised algorithm to understand. It is highly 
interpretable and easy to execute. Given a new data point x, which does not 
have a label, simply find the k nearest neighbors x; with labels y;. The label of 
the new point x; is determined by a majority vote of the k nearest neighbors. 
Given a model for the data, the knnsearch uses the Mdl to label the test data, 
test. 


Naive Bayes 


The naive Bayes algorithm provides an intuitive framework for supervised 
learning. It is simple to construct and does not require any complicated pa- 
rameter estimation, similar to SVM and/or classification trees. It further gives 
highly interpretable results that are remarkably good in practice. The method is 
based upon Bayes’s theorem and the computation of conditional probabilities. 
Thus one can estimate the label of a new data point based on the prior prob- 
ability distributions of the labeled data. The fitcNativeBayes command takes 
in labeled training data denoted by train and label, and produces a structured 
output, Model. The structured output can be used with the predict command 
to label test data, test. 
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AdaBoost (Ensemble Learning and Boosting) 


AdaBoost is an example of an ensemble learning algorithm [251]. Broadly speak- 
ing, AdaBoost is a form of random forest which takes into account an en- 
semble of decision tree models. The way all boosting algorithms work is to first 
consider an equal weighting for all training data x,;. Boosting re-weights the 
importance of the data according to how difficult they are to classify. Thus the 
algorithm focuses on harder-to-classify data. A family of weak learners can be 
trained to yield a strong learner by boosting the importance of hard-to-classify 
data [629]. This concept and its usefulness are based upon a seminal theoretical 
contribution by Kearns and Valiant [378]. The fitcensemble command is a gen- 
eral ensemble learner that can do many more things than AdaBoost, including 
robust boosting and gradient boosting. Gradient boosting is one of the most 
powerful techniques [252]. 


C4.5 (Ensemble Learning of Decision Trees) 


This algorithm is another variant of decision tree learning developed by J. R. 
Quinlan [579] [580]. At its core, the algorithm splits the data according to an in- 
formation entropy score. In its latest versions, it supports boosting as well as 
many other well-known functionalities to improve performance. Broadly, we 
can think of this as a strong-performing version of CART. The fitcensemble 
algorithm highlighted with AdaBoost gives a generic ensemble learning archi- 
tecture that can incorporate decision trees, allowing for a C4.5-like algorithm. 


Apriori Algorithm 


The last two methods highlighted here tend to focus on different aspects of 
data mining. In the Apriori algorithm, the goal is to find frequent item sets 
from data. Although this may sound trivial, it is not, since data sets tend to be 
very large and can easily produce NP-hard computations because of the com- 
binatorial nature of the algorithms. The Apriori algorithm provides an efficient 
algorithm for finding frequent item sets using a candidate generation architec- 
ture [6]. This algorithm can then be used for fast learning of associate rules in 
the data. 


PageRank 


The founding of Google by Sergey Brin and Larry Page revolved around the 
PageRank algorithm [114]. PageRank produces a static ranking of variables, 
such as web pages, by computing an offline value for each variable that does 
not depend on search queries. The PageRank is associated with graph theory, 
as it originally interpreted a hyperlink from one page to another as a vote. From 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


248 CHAPTER 5. CLUSTERING AND CLASSIFICATION 


this, and various modifications of the original algorithm, one can then compute 
an importance score for each variable and provide an ordered rank list. The 
number of enhancements for this algorithm is quite large. Producing accurate 
orderings of variables (web pages) and their importance remains an active topic 
of research. 
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Homework 


Exercise 5-1. Download the MNIST data set (both training and test sets and la- 


bels) from http: //yann.lecun.com/exdb/mnist/, Perform the following 


analysis: 


(a) Do an SVD analysis of the digit images. You will need to reshape each 
image into a column vector, and each column of your data matrix is a 
different image. 


(b) What does the singular value spectrum look like, and how many modes 
are necessary for good image reconstruction? (That is, what is the rank r 
of the digit space?) 


(c) What is the interpretation of the U, X, and V matrices? 


(d) On a 3D plot, project onto three selected V modes (columns) colored by 
their digit label, for example, columns 2, 3, and 5. 


Once you have performed the above and have your data projected into PCA 
space, you will build a classifier to identify individual digits in the training set. 


(e) Pick two digits. See if you can build a linear classifier (LDA) that can 
reasonable identify them. 


(f) Pick three digits. Try to build a linear classifier to identify these three now. 


(g) Which two digits in the data set appear to be the most difficult to sepa- 
rate? Quantify the accuracy of the separation with LDA on the test data. 


(h) Which two digits in the data set are most easy to separate? Quantify the 
accuracy of the separation with LDA on the test data. 


(i) SVM (support vector machines) and decision tree classifiers were the state 
of the art until about 2014. How well do these separate between all 10 
digits? 


(j) Compare the performance between LDA, SVM, and decision trees on the 
hardest and easiest pair of digits to separate (from above). 


Make sure to discuss the performance of your classifier on both the training 
and test sets. 


Exercise 5-2. Download the two data sets (ORIGINAL IMAGE and CROPPED 
IMAGES) from Yale Faces B. Your job is to perform an analysis of these data 
sets. Start with the cropped images and perform the following analysis. 
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(a) Do an SVD analysis of the images (where each image is reshaped into a 
column vector and each column is a new image). 


(b) What is the interpretation of the U, X, and V matrices? 


(c) What does the singular value spectrum look like and how many modes 
are necessary for good image reconstructions? (That is, what is the rank r 
of the face space?) 


(d) Compare the difference between the cropped (and aligned) versus un- 
cropped images. 


Face identification: see if you can build a classifier to identify individuals in the 
training set. 


(e) (Test 1) face classification: Consider the various faces and see if you can 
build a classifier that can reasonably identify an individual face. 


(f) (Test 2) gender classification: Can you build an algorithm capable of rec- 
ognizing men from women? 


(g) (Test 3) unsupervised algorithms: In an unsupervised way, can you de- 
velop algorithms that automatically find patters in the faces that naturally 
cluster? 


(Note: You can use any (and hopefully all) of the different clustering and clas- 


sification methods discussed. Be sure to compare them against each other in 
these tasks.) 
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Chapter 6 


Neural Networks and Deep Learning 


Neural networks (NNs) were inspired by the Nobel Prize winning work of 
Hubel and Wiesel on the primary visual cortex of cats [342]. Their seminal 
experiments showed that neuronal networks were organized in hierarchical 
layers of cells for processing visual stimuli. The first mathematical model of 
the NN, termed the Neocognitron in 1980 [260], had many of the character- 
istic features of today’s deep convolutional neural networks (DCNNs), includ- 
ing a multi-layer structure, convolution, max pooling, and nonlinear dynamical 
nodes. The recent success of DCNNs in computer vision has been enabled by 
two critical components: (i) the continued growth of computational power, and 
(ii) exceptionally large labeled data sets which take advantage of the power of a 
deep multi-layer architecture. Indeed, although the theoretical inception of NNs 
has an almost four-decade history, the analysis of the ImageNet data set in 2012 
provided a watershed moment for NNs and deep learning [432]. Prior to 
this data set, there were a number of data sets available with approximately 
tens of thousands of labeled images. ImageNet provided over 15 million la- 
beled, high-resolution images with over 22 000 categories. DCNNs, which are 
only one potential category of NNs, have since transformed the field of com- 
puter vision by dominating the performance metrics in almost every meaning- 
ful computer vision task intended for classification and identification. 
Although ImageNet has been critically enabling for the field, NNs were 
textbook material in the early 1990s, with a focus typically on a small number 
of layers. Critical machine learning tasks such as principal component analy- 
sis (PCA) were shown to be intimately connected with networks that included 
backpropagation. Importantly, there were a number of critical innovations which 
established multi-layer feedforward networks as a class of universal approxi- 
mators [338]. The past decade has seen tremendous advances in NN architec- 
tures, many designed and tailored for specific application areas. Innovations 
have come from algorithmic modifications that have led to significant per- 
formance gains in a variety of fields. These innovations include pre-training, 
dropout, inception modules, data augmentation with virtual examples, batch 
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normalization, and/or residual learning (see Goodfellow et al. for a de- 
tailed exposition of NNs). This is only a partial list of potential algorithmic 
innovations, thus highlighting the continuing and rapid pace of progress in the 
field. Remarkably, NNs were not even listed as one of the top 10 algorithms of 
data mining in 2008 [764]. But a decade later, the undeniable and growing list 
of successes of NNs on challenge data sets make them perhaps the most impor- 
tant data-mining tool for our emerging generation of scientists and engineers. 

As already shown in the last two chapters, all of machine learning revolves 
fundamentally around optimization. NNs specifically optimize over a compo- 
sitional function 


argmin(fiw(Awm,..-, fo(Ae, fi(Ai,x))-..) + Ag(Aj)), (6.1) 


A; 


which is often solved using stochastic gradient descent and backpropagation 
algorithms. Each matrix A, denotes the weights connecting the neural network 
from the kth to the (k + 1)th layer. It is a massively under-determined system 
which is regularized by g(A,;). Composition and regularization are critical for 
generating expressive representations of the data and preventing overfitting, 
respectively. The notation used in is motivated from solving linear systems 
Ax = b through regression. This will be highlighted in the first few sections of 
this chapter. We will then move to a broader framework of mapping input data 
X to output data Y using a model f(-). Thus we will represent in deep 
learning models as 

argmin fg(x), (6.2) 

6 


where @ are the neural network weights and f(-) characterizes the network 
(number of layers, structure, regularizers). Thus we will move to this nota- 
tion in the second half of this chapter as a generic representation of a neural 
net. This general optimization framework is at the center of deep learning al- 
gorithms, and its solution will be considered in this chapter. Importantly, NNs 
have significant potential for overfitting of data so that cross-validation must 
be carefully considered. Recall that: if you do not cross-validate, you is dumb. 


6.1 Neural Networks: Single-Layer Networks 


The generic architecture of a multi-layer NN is shown in Fig. For classifi- 
cation tasks, the goal of the NN is to map a set of input data to a classification. 
Specifically, we train the NN to accurately map the data x; to their correct la- 
bel y;. As shown in Fig. the input space has the dimension of the raw data 
x; € R”. The output layer has the dimension of the designed classification 
space. Constructing the output layer will be discussed further in the following. 
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Input layer x 
Output layer y 


Figure 6.1: Illustration of a neural net architecture mapping an input layer x to 
an output layer y. The middle (hidden) layers are denoted x) where j deter- 
mines their sequential ordering. The matrices A, contain the coefficients that 
map each variable from one layer to the next. Although the dimensionality of 
the input layer x € R” is known, there is great flexibility in choosing the di- 
mension of the inner layers as well as how to structure the output layer. The 
number of layers and how to map between layers is also selected by the user. 
This flexible architecture gives great freedom in building a good classifier. 


Immediately, one can see that there are a great number of design questions 
regarding NNs. How many layers should be used? What should be the dimen- 
sion of the layers? How should the output layer be designed? Should one use 
all-to-all or sparsified connections between layers? How should the mapping 
between layers be performed: a linear mapping or a nonlinear mapping? Much 
like the tuning options on SVM and classification trees, NNs have a significant 
number of design options that can be tuned to improve performance. 

Initially, we consider the mapping between layers of Fig.|6.1} We denote the 
various layers between input and output as x“), where k is the layer number. 
For a linear mapping between layers, the following relations hold 


x) = Ax, (6.3a) 
x?) = Ax), (6.3b) 
y = Ax), (6.3c) 


This forms a compositional structure so that the mapping between input and 
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output can be represented as 
y= A35A2ÅA1X. (6.4) 


This basic architecture can scale to M layers, so that a general representation 
between input data and the output layer for a linear NN is given by 


y = AvAw-1-:: A2Aix. (6.5) 


This is generally a highly under-determined system that requires some con- 
straints on the solution in order to select a unique solution. One constraint is 
immediately obvious: the mapping must generate M distinct matrices that give 
the best mapping. It should be noted that linear mappings, even with a com- 
positional structure, can only produce a limited range of functional responses 
due to the limitations of the linearity. 

Nonlinear mappings are also possible, and generally used, in constructing 
the NN. Indeed, nonlinear activation functions allow for a richer set of func- 
tional responses than their linear counterparts. In this case, the connections be- 
tween layers are given by 


x = fi(Aı,x), (6.6a) 
x?) = fo(Ay,x), (6.6b) 
y= fs(Ag, x). (6.6c) 


Note that we have used different nonlinear functions f;(-) between layers. Of- 
ten a single function is used; however, there is no constraint that this is neces- 
sary. In terms of mapping the data between input and output over M layers, 
the following is derived: 


y = fu(Am,..--, fo(As, fi(Au, x))-.-), (6.7) 


which can be compared with for the general optimization which con- 
structs the NN. As a highly under-determined system, constraints should be 
imposed in order to extract a desired solution type, as in (6.1). For big data 
applications such as ImageNet and computer vision tasks, the optimization 
associated with this compositional framework is expensive given the number 
of variables that must be determined. However, for moderate-sized networks, 
it can be performed on workstation and laptop computers. Modern stochas- 
tic gradient descent and backpropagation algorithms enable this optimization, 
and both are covered in later sections. 


A Single-Layer Network 


To gain insight into how an NN might be constructed, we will consider a single- 
layer network that is optimized to build a classifier between dogs and cats. 
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Input layer x Perceptron y € {+1} 
+1 dog 


—1 cat 


Figure 6.2: Single-layer network for binary classification between dogs and cats. 
The output layer for this case is a perceptron with y € {+1}. A linear mapping 
between the input image space and output layer can be constructed for training 
data by solving A = YX". This gives a least-squares regression for the matrix 
A mapping the images to label space. 


The dogs and cats example was considered extensively in the previous chapter. 
Recall that we were given images of dogs and cats, or a wavelet version of dogs 
and cats. Figure|6.2|shows our construction. To make this as simple as possible, 
we consider the simple NN output 


y = {dog, cat} = {+1, —1}, (6.8) 


which labels each data vector with an output y € {+1}. In this case the output 
layer is a single node. As in previous supervised learning algorithms, the goal 
is to determine a mapping so that each data vector x; is labeled correctly by y;. 

The easiest mapping is a linear mapping between the input images x; € R” 
and the output layer. This gives a linear system AX = Y of the form 


AX=Y => 


[a1 a2 +++ Gn] | X1 XQ © Xp | =([4+141--- —1 —1], (6.9) 
|| | 


where each column of the matrix X is a dog or a cat image and the columns of Y 
are its corresponding labels. Since the output layer is a single node, both A and 
Y reduce to vectors. In this case, our goal is to determine the matrix (vector) A 
with components aj. The simplest solution is to take the pseudo-inverse of the 
data matrix X: 

A=YXİ. (6.10) 
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Thus a single output layer allows us to build a NN using least-squares fitting. 
Of course, we could also solve this linear system in a variety of other ways, in- 
cluding with sparsity-promoting methods. The following code solves this prob- 
lem through both least-squares fitting (pinv) and the LASSO. 


Code 6.1: [MATLAB] Single-layer, linear neural network. 


test=[dog_wave(:,61:80) cat_wave(:,61:80)]; 
label=[ones (60,1); -l*xones(60,1)].’; 


~~ 


A=labelxpinv (train); test_labels=sign (Axtest) ; 
AShassoi(eraii. ~label ./ Lampda = (Or) 
test_labels=sign (Axtest); 


Code 6.1: [Python] Single-layer, linear neural network. 


train = np.concatenate((dog_wave[:,:60],cat_wave[:,:60]), 


axis=1) 

test = np.concatenate((dog_wave[:,60:80],cat_wave[:,60:80]), 
axis=1) 

label = np.repeat (np.array([1,-1]),60) 

A = label @ np.linalg.pinv (train) 


test labels = np.sign(A@test) 


lasso = linear_model.Lasso().fit(train.T, label) 
A_lasso = lasso.coef_ 
test_labels_lasso = np.sign(A_lasso@test) 


Figures [6.3] and |6.4| show the results of this linear single-layer NN with 
single-node output layer. Specifically, the four rows of Fig.|6.3]show the output 
layer on the withheld test data for both the pseudo-inverse and LASSO meth- 
ods along with a bar graph of the 32 x 32 (1024 pixels) weightings of the matrix 
A. Note that all matrix elements are non-zero in the pseudo-inverse solution, 
while the LASSO highlights a small number of pixels that can classify the pic- 
tures as well as using all pixels. Figure|6.4|shows the matrix A for the two solu- 
tion strategies reshaped into 32 x 32 images. Note that, for the pseudo-inverse, 
the weightings of the matrix elements A show many features of the cat and dog 
faces. For the LASSO method, only a few pixels are required that are clustered 
near the eyes and ears. Thus for this single-layer network, interpretable results 
are achieved by looking at the weights generated in the matrix A. 


6.2 Multi-Layer Networks and Activation Functions 


The previous section constructed what is perhaps the simplest NN possible. It 
was linear, had a single layer, and a single output layer neuron. The potential 
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Figure 6.3: Classification of withheld data tested on a trained, single-layer net- 
work with linear mapping between inputs (pixel space) and a single output. 
Panels (a) and (c) are the bar graphs of the output layer score y € {+1} achieved 
for the withheld data using a pseudo-inverse for training and the LASSO for 
training, respectively. The results show in both cases that dogs are more often 
misclassified than cats are misclassified. Panels (b) and (d) show the coefficients 
of the matrix A for the pseudo-inverse and LASSO, respectively. Note that the 
LASSO has only a small number of non-zero elements, thus suggesting that the 
NN is highly sparse. 


generalizations are endless, but we will focus on two simple extensions of the 
NN in this section. The first extension concerns the assumption of linearity in 
which we assumed that there is a linear transform from the image space to the 
output layer: Ax = y in (6.9). We highlight here common nonlinear transfor- 
mations from input to output space represented by 


y = f(A,x) (6.11) 


where f(-) is a specified activation function (transfer function) for our mapping. 
The linear mapping used previously, although simple, does not offer the 
flexibility and performance that other mappings offer. Some standard activa- 
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Figure 6.4: Weightings of the matrix A reshaped into 32 x 32 arrays. The left 
matrix shows the matrix A computed by least-squares regression (the pseudo- 
inverse) while the right matrix shows the matrix A computed by LASSO. Both 
matrices provide similar classification scores on withheld data. They further 
provide interpretability in the sense that the results from the pseudo-inverse 
show many of the features of dogs and cats while the LASSO shows that mea- 
suring near the eyes and ears alone can give the features required for distin- 
guishing between dogs and cats. 


tion functions are given by 


FrV r linear, (6.12a) 
0 forz <0, ; 
f(x) = (oi 0. binary step, (6.12b) 
1 
= —_______ logisti f 12 
f(x) ie ogistic (soft step), (6.12c) 
f(x) = tanh(z), tanh, (6.12d) 
{nS OF ere =, rectified linear unit (ReLU) (6.12e) 
Ve forex > 0, i ` 


There are other possibilities, but these are perhaps the most commonly consid- 
ered in practice and they will serve for our purposes. Importantly, the chosen 
function f(x) will be differentiated in order to be used in gradient descent algo- 
rithms for optimization. Each of the functions above is either differentiable or 
piecewise differentiable. Perhaps the most commonly used activation function 
is currently the ReLU, which we denote f(x) = ReLU(z). 

With a nonlinear activation function f(x), or if there is more than one layer, 
then standard linear optimization routines such as the pseudo-inverse and LASSO 
can no longer be used. Although this may not seem immediately significant, 
recall that we are optimizing in a high-dimensional space where each entry 
of the matrix A needs to be found through optimization. Even moderate to 
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small problems can be computationally expensive to solve without using spe- 
cialty optimization methods. Fortunately, the two dominant optimization com- 
ponents for training NNs, stochastic gradient descent (SGD) and backpropaga- 
tion (backprop), are included with the neural network function calls in MAT- 
LAB. As these methods are critically enabling, both of them are considered in 
detail in the next two sections of this chapter. 

Multiple layers can also be considered as shown in and (6.6c). In this 
case, the optimization must simultaneously identify multiple connectivity ma- 
trices A;, Av,..., Am, in contrast to the linear case, where only a single matrix 
is determined, A = Ay, --- AA. The multiple-layer structure significantly in- 
creases the size of the optimization problem, as each matrix element of the M 
matrices must be determined. Even for a single-layer structure, an optimiza- 
tion routine such as fminsearch will be severely challenged when considering 
a nonlinear transfer function, and one needs to move to a gradient descent- 
based algorithm. 

MATLAB’s neural network toolbox, much like TensorFlow in Python, has 
a wide range of features, which makes it exceptionally powerful and conve- 
nient for building NNs. In the following code, we will train a NN to classify 
between dogs and cats as in the previous example. However, in this case, we 
allow the single layer to have a nonlinear transfer function that maps the input 
to the output layer. The output layer for this example will be modified to the 
following: 


_ EF | = {dog} and y= a | = {cat}. (6.13) 


Half of the data is extracted for training, while the other half is used for testing 
the results. The following code builds a network using the train command to 
classify between our images. 


Code 6.2: [MATLAB] Neural network with nonlinear transfer functions. 
net = patternnet (2,’trainscg’ ); 
net.layers{1l}.transferFcn = ’tansig’; 


net = train(net,x,label); 


In the code above, the patternnet command builds a classification network 
with two outputs (6.13). It also optimizes with the option trainscg which is a 
scaled conjugate gradient backpropagation. The net.layers also allows us to spec- 
ify the transfer function, in this case hyperbolic tangent functions (6.12e}1). The 
view(net) command produces a diagnostic tool shown in Fig.|6.5|that summa- 
rizes the optimization and NN. 

The results of the classification for a cross-validated training set as well as a 
withhold set are shown in Fig.[6.6] Specifically, the desired outputs are given by 
the vectors (6.13). For both the training and withhold sets, the two components 
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Figure 6.5: MATLAB neural network visualization tool. The number of itera- 
tions along with the performance can all be accessed from the interactive graph- 
ical tool. The performance, error histogram, and confusion buttons produce 


Figs. |6.7}/6.9| respectively. 


of the vector are shown for the 80 training images (40 cats and 40 dogs) and the 
80 withheld images (40 cats and 40 dogs). The training set produces a perfect 
classifier using a single-layer network with a hyperbolic tangent transfer func- 
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Figure 6.6: Comparison of the output vectors y = [yı y2|’, which are ideally 
for the dogs and cats considered here. The NN training stage produces a 
cross-validated classifier that achieves 100% accuracy in classifying the training 
data (top two panels for 40 dogs and 40 cats). When applied to a withheld set, 
85% accuracy is achieved (bottom two panels for 40 dogs and 40 cats). 


tion (6.12eH). On the withheld data, it incorrectly identifies six of 40 dogs and 
cats, yielding an accuracy of ~85% on new data. 

The diagnostic tool shown in Fig.|6.5|allows access to a number of features 
critical for evaluating the NN. Figure is a summary of the performance 
achieved by the NN training tool. In this figure, the training algorithm automat- 
ically breaks the data into a training, validation, and test set. The backpropagation- 
enabled, stochastic gradient descent optimization algorithm then iterates through 
a number of training epochs until the cross-validated error achieves a mini- 
mum. In this case, 22 epochs are sufficient to achieve a minimum. The error 
on the test set is significantly higher than what is achieved for cross-validation. 
For this case, only a limited amount of data is used for training (40 dogs and 
40 cats), thus making it difficult to achieve great performance. Regardless, as 
already shown, once the algorithm has been trained, it can be used to evaluate 
new data as shown in Fig. 

There are two other features easily available with the NN diagnostic tool of 
Fig. |6.5| Figure [6.8] shows an error histogram associated with the trained net- 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


6.3. THE BACKPROPAGATION ALGORITHM 263 


Best Validation Performance is 0.0041992 at epoch 22 
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Figure 6.7: Summary of training of the NN over a number of epochs. The NN 
architecture automatically separates the data into training, validation, and test 
sets. The training continues (with a maximum of 1000 epochs) until the val- 
idation error curve hits a minimum. The training then stops and the trained 
algorithm is then used on the test set to evaluate performance. The NN trained 
here has only a limited amount of data (40 dogs and 40 cats), thus limiting the 
performance. This figure is accessed with the performance button on the NN 
interactive tool of Fig. 


work. As with Fig. the data is divided into training, validation, and test 
sets. This provides an overall assessment of the classification quality that can 
be achieved by the NN training algorithm. Another view of the performance 
can be seen in the confusion matrices for the training, validation, and test data. 
This is shown in Fig. Overall, between Figs. |6.7|to high-quality diag- 
nostic tools are available to evaluate how well the NN is able to achieve its 
classification task. The performance limits are easily seen in these figures. 


6.3 The Backpropagation Algorithm 
As was shown for the NNs of the last two sections, training data is required to 
determine the weights of the network. Specifically, the network weights are 


determined so as to best classify dog versus cat images. In the single-layer 
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Figure 6.8: Summary of the error performance of the NN architecture for train- 
ing, validation, and test sets. This figure is accessed with the errorhistogram 
button on the NN interactive tool of Fig. 


network, this was done using both least-squares regression and LASSO. This 
shows that, at its core, an optimization routine and objective function are re- 
quired to determine the weights. The objective function should minimize a 
measure of the misclassified images. The optimization, however, can be mod- 
ified by imposing a regularizer or constraints, such as the 4} penalization in 
LASSO. 

In practice, the objective function chosen for optimization is not the true 
objective function desired, but rather a proxy for it. Proxies are chosen largely 
due to the ability to differentiate the objective function in a computationally 
tractable manner. There are also many different objective functions for differ- 
ent tasks. Instead, one often considers a suitably chosen loss function so as to 
approximate the true objective. Ultimately, computational tractability is critical 
for training NNs. 

The backpropagation algorithm (backprop) exploits the compositional na- 
ture of NNs in order to frame an optimization problem for determining the 
weights of the network. Specifically, it produces a formulation amenable to 
standard gradient descent optimization (see Section|4.2). Specifically, backprop 
calculates the gradient of the error, which is then used for gradient descent. 
Backprop relies on a simple mathematical principle: the chain rule for differ- 
entiation. Moreover, it can be proven that the computational time required to 
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Figure 6.9: Summary of the error performance through confusion matrices of 
the NN architecture for training, validation, and test sets. This figure is accessed 
with the confusion button on the NN interactive tool of Fig. 


evaluate the gradient is within a factor of 5 of the time required for comput- 
ing the actual function itself [59]. This is known as the Baur-Strassen theorem. 
Figure gives the simplest example of backprop and how the gradient de- 
scent is to be performed. The input-to-output relationship for this single-node, 
one-hidden-layer network is given by 
y = g(z,b) = g(f (x, a), b). (6.14) 
Thus, given functions f(-) and g(-) with weighting constants a and b, the output 
error produced by the network can be computed against the ground truth as 
1 
E= 5 (Yo = y)’, (6.15) 
where yo is the correct output and y is the NN approximation to the output. 
The goal is to find a and b to minimize the error. The minimization requires 


OE dy dz 
3a =(Yo = Y) zda =, (6.16) 
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input hidden layer output 
y = g(f (x,a), b) 


Figure 6.10: Illustration of the backpropagation algorithm on a one-node, one- 
hidden-layer network. The compositional nature of the network gives the 
input-output relationship y = g(z,b) = g(f (x,a), b). By minimizing the error 
between the output y and its desired output yo, the composition along with the 
chain rule produces an explicit formula for updating the values of the 
weights. Note that the chain rule backpropagates the error all the way through 
the network. Thus, by minimizing the output, the chain rule acts on the compo- 
sitional function to produce a product of derivative terms that advance back- 
ward through the network. 


A critical observation is that the compositional nature of the network along 

with the chain rule forces the optimization to backpropagate error through the 

network. In particular, the terms (dy/dz)(dz/da) show how this backprop oc- 

curs. Given functions f(-) and g(-), the chain rule can be explicitly computed. 
Backprop results in an iterative, gradient descent update rule: 


p41 = Ap —O——, (6.17a) 
Oar 
OE 

bk1 = bk — 6— .1 

bn = hia (6.176) 


where ô is the so-called learning rate and 0E/0a along with OE /0b can be 
explicitly computed using (6.16). The iteration algorithm is executed to con- 
vergence. As with all iterative optimization, a good initial guess is critical to 
achieve a good solution in a reasonable amount of computational time. 
Backprop proceeds as follows: (i) A NN is specified along with a labeled 
training set. (ii) The initial weights of the network are set to random values. 
Importantly, one must not initialize the weights to zero, similar to what may 
be done in other machine learning algorithms. If weights are initialized to zero, 
after each update, the outgoing weights of each neuron will be identical, be- 
cause the gradients will be identical. Moreover, NNs often get stuck at local 
optima where the gradient is zero but that are not global minima, so random 
weight initialization allows one to have a chance of circumventing this by start- 
ing at many different random values. (iii) The training data is run through the 
network to produce an output y, whose ideal ground-truth output is yo. The 
derivatives with respect to each network weight are then computed using back- 
prop formulas (6.16). (iv) For a given learning rate ô, the network weights are 
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updated as in (6.17). (v) We return to step (iii) and continue iterating until a 
maximum number of iterations is reached or convergence is achieved. 
As a simple example, consider the linear activation function 


f(E a) = 9(€,0) = ae. (6.18) 
In this case we have in Fig. 


z =z, (6.19a) 
y = bz. (6.19b) 


We can now explicitly compute the gradients such as (6.16). This gives 


Fa = (yo - W) 5 5 = —(yo —y)-b-2, (6.20a) 
OE _ dy B 
g = 0Y) Z0) —(Yo—y) aw. (6.20b) 


Thus, with the current values of a and b, along with the input-output pair x 
and y and target truth yo, each derivative can be evaluated. This provides the 
required information to perform the update (6.17). 

The backprop for a deeper net follows in a similar fashion. Consider a net- 
work with M hidden layers labeled z; to zm, with the first connection weight a 
between « and z1. The generalization of Fig.|6.10]and is given by 


OE dy dm dzz dz 


Se eye a . 21 
ða (Yo — 9) dzm dčm—1 dz, da Mean 


The cascade of derivatives induced by the composition and chain rule high- 
lights the backpropagation of errors that occurs when minimizing the classifi- 
cation error. 

A full generalization of backprop involves multiple layers as well multiple 
nodes per layer. The general situation is illustrated in Fig.[6.1| The objective is to 
determine the matrix elements of each matrix A;. Thus a significant number of 
network parameters need to be updated in gradient descent. Indeed, training a 
network can often be computationally infeasible even though the update rules 
for individual weights are not difficult. NNs can thus suffer from the curse of 
dimensionality, as each matrix from one layer to another requires updating n? 
coefficients for an n-dimensional input, assuming the two connected layers are 
both n-dimensional. 

Denoting all the weights to be updated by the vector w, where w contains 
all the elements of the matrices A, illustrated in Fig. then 


Wri = Wk — Ô VE, (6.22) 
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where the gradient of the error VE, through the composition and chain rule, 
produces the backpropagation algorithm for updating the weights and reduc- 
ing the error. Expressed in a component-by-component way: 


OE 

W = wW} — ETA (6.23) 
where this equation holds for the jth component of the vector w. The term 
OE /Ow! produces the backpropagation through the chain rule, i.e., it produces 
the sequential set of functions to evaluate as in (6.21). Methods for solving 
this optimization more quickly, or even simply enabling the computation to 
be tractable, remain of active research interest. Perhaps the most important 
method is stochastic gradient descent, which is considered in the next section. 


6.4 The Stochastic Gradient Descent Algorithm 


Training neural networks is computationally expensive due to the size of the 
NNs being trained. Even NNs of modest size can become prohibitively expen- 
sive if the optimization routines used for training are not well informed. Two al- 
gorithms have been especially critical for enabling the training of NNs: stochas- 
tic gradient descent (SGD) and backprop. Backprop allows for an efficient com- 
putation of the objective function’s gradient, while SGD provides a more rapid 
evaluation of the optimal network weights. Although alternative optimization 
methods for training NNs continue to provide computational improvements, 
backprop and SGD are both considered here in detail so as to give the reader 
an idea of the core architecture for building NNs. 

Gradient descent was considered in Section |4.2} Recall that this algorithm 
was developed for nonlinear regression where the data fit takes the general 
form 


where 0 are fitting coefficients used to minimize the error. In NNs, the param- 
eters 0 are the network weights; thus we can rewrite this in the form 


f(x) = f(x, Ai, A2, ..., Am), (6.25) 


where the A; are the connectivity matrices from one layer to the next in the NN. 

Thus A, connects the first and second layers, and there are M hidden layers. 
The goal of training the NN is to minimize the error between the network 

and the data. The standard root-mean-square error for this case is defined as 


argmin E( Ay, Ao, ga ,Am) = argmin X(f (xx; Au, Ao, sas ,Am) E Yh, (6.26) 


Aj Aj k=l 
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which can be minimized by setting the partial derivative with respect to each 
matrix component to zero, i.e., we require OE /O(a;;), = 0, where (aij) is the 
ith row and jth column of the kth matrix (k = 1,2,...M/). Recall that the zero 
derivative is a minimum, since there is no maximum error. This gives the gra- 
dient V f(x) of the function with respect to the NN parameters. Note further 
that f(-) is the function evaluated at each of the n data points. 

As was shown in Section this leads to a Newton-Raphson iteration 
scheme for finding the minima, 


xj+1(0) = xj — ô V f (x;), (6.27) 


where ô is a parameter determining how far a step should be taken along the 
gradient direction. In NNs, this parameter is called the learning rate. Unlike 
standard gradient descent, it can be computationally prohibitive to compute 
an optimal learning rate. 

Although the optimization formulation is easily constructed, evaluating 
is often computationally intractable for NNs. This is due to two reasons: (i) the 
number of matrix weighting parameters for each A; is quite large, and (ii) the 
number of data points n is generally also large. 

To render the computation potentially tractable, SGD does not esti- 
mate the gradient in using all n data points. Rather, a single, randomly 
chosen data point, or a subset for batch gradient descent, is used to approximate 
the gradient at each step of the iteration. In this case, we can reformulate the 
least-squares fitting of so that 


E(Ay, Ao,..., Am) = X F(A, Ao,..., Am) (6.28) 
k=1 
and 
E;(A;, Ao, tee , Am) = (Fae; Aj, Ao, tee , Am) = Ve) (6.29) 


where f),(-) is now the fitting function for each data point, and the entries of the 
matrices A; are determined from the optimization process. 
The gradient descent iteration algorithm (6.27) is now updated as follows: 


Wj+1 (ô) = Wj ô VEx(w;), (6.30) 


where w; is the vector of all the network weights from A; (j = 1,2,..., M) at 
the jth iteration, and the gradient is computed using only the kth data point 
and f;,(-). Thus, instead of computing the gradient with all n points, only a sin- 
gle data point is randomly selected and used. At the next iteration, another ran- 
domly selected point is used to compute the gradient and update the solution. 
The algorithm may require multiple passes through all the data to converge, 
but each step is now easy to evaluate versus the expensive computation of the 
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5 6 -4 -2 0 2 4y 8 
Figure 6.11: Stochastic gradient descent applied to the function featured in 
Fig. [4.3{b). The convergence can be compared to a full gradient descent al- 
gorithm as shown in Fig. Each step of the stochastic (batch) gradient de- 
scent selects 100 data points for approximating the gradient, instead of the 
104 data points of the data. Three initial conditions are shown: (zo, yo) = 
{(0, 4), (—5, 0), (2, —5)}. The first of these (red circles) gets stuck in a local mini- 
mum, while the other two initial conditions (blue and magenta) find the global 
minimum. Interpolation of the gradient functions of Fig. [4.5]is used to update 
the solutions. 


Jacobian that is required for the gradient. If, instead of a single point, a subset 
of points is used, then we have the following batch gradient descent algorithm: 


w+41(d) = Wj ô VEK(w;), (6.31) 


where K € {k1,k2,...,k,| denotes the p randomly selected data points k; used 
to approximate the gradient. 

Code from Section [4.2] can be modified for the stochastic gradient descent. 
The modification here involves taking a significant subsampling of the data to 
approximate the gradient. Specifically, a batch gradient descent is illustrated 
with a fixed learning rate of ô = 2. Ten points are used to approximate the 
gradient of the function at each step. 

Figure shows the convergence of SGD for three initial conditions. As 
with gradient descent, the algorithm can get stuck in local minima. However, 
the SGD now approximates the gradient with only 100 points instead of the full 
10* points, thus allowing for a computation that is three orders of magnitude 
smaller. Importantly, the SGD is a scalable algorithm, allowing for significant 
computational savings even as the data grows to be high-dimensional. For this 
reason, SGD has become a critically enabling part of NN training. Note that 
the learning rate, batch size, and data sampling play an important role in the 
convergence of the method. 
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6.5 Deep Convolutional Neural Networks 


With the basics of the NN architecture in hand, along with an understanding 
of how to formulate an optimization framework powered by SGD and back- 
prop, we are ready to construct deep convolution neural nets (DCNNs), which are 
the fundamental building blocks of deep learning methods. Indeed, today when 
practitioners generally talk about NNs for practical use, they are typically talk- 
ing about DCNNs. Of course, natural language processing (NLP) is another im- 
portant class powered by recurrent neural networks (RNNs). But as much as we 
would like to have a principled approach to building DCNNs, there remains a 
great deal of artistry and expert intuition for producing the highest-performing 
networks. Moreover, DCNNs are especially prone to overtraining, thus requir- 
ing special care to cross-validate the results. The recent textbook on deep learn- 
ing by Goodfellow et al. provides a detailed and extensive account of the 
state of the art in DCNNs. It is especially useful for highlighting many rules of 
thumb and tricks for training effective DCNNs. 

Like SVM and random forest algorithms, the MATLAB package for building 
NNs has a tremendous number of features and tuning parameters. This flexibil- 
ity is both advantageous and overwhelming at the same time. As was pointed 
out at the beginning of this chapter, it is immediately evident that there are a 
great number of design questions regarding NNs. How many layers should be 
used? What should be the dimension of the layers? How should the output 
layer be designed? Should one use all-to-all or sparsified connections between 
layers? How should the mapping between layers be performed: a linear map- 
ping or a nonlinear mapping? 

The prototypical structure of a DCNN is illustrated in Fig. Included 
in the visualization is a number of commonly used convolutional and pooling 
layers. Also illustrated is the fact that each layer can be used to build multiple 
downstream layers, or feature spaces, which can be engineered by the choice 
of activation functions and/or network parameterizations. All of these layers 
are ultimately combined into the output layer. The number of connections that 
require updating through backprop and SGD can be extraordinarily high; thus 
even modest networks and training data may require significant computational 
resources. A typical DCNN is constructed of a number of layers, with DCNNs 
typically having 7—10 layers. More recent efforts have considered the advan- 
tages of a truly deep network with approximately 100 layers, but the merits 
of such architectures are still not fully known. The following paragraphs high- 
light some of the more prominent elements that comprise DCNNs, including 
convolutional layers, pooling layers, fully connected layers, and dropout. 
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Figure 6.12: Prototypical DCNN architecture which includes commonly used 
convolutional and pooling layers. The dark gray boxes show the convolutional 
sampling from layer to layer. Note that, for each layer, many functional trans- 
formations can be used to produce a variety of feature spaces. The network 
ultimately integrates all this information into the output layer. 


Convolutional Layers 


Convolutional layers are similar to windowed (Gabor) Fourier transforms or 
wavelets from Chapter [2} in that a small selection of the full high-dimensional 
input space is extracted and used for feature engineering. Figure|6.12|/shows the 
convolutional windows (dark gray boxes) that are slid across the entire layer 
(light gray boxes). Each convolutional window transforms the data into a new 
node through a given activation function, as shown in Fig. [6.12{a). The feature 
spaces are thus built from the smaller patches of the data. Convolutional layers 
are especially useful for images, as they can extract important features such as 
edges. Wavelets are also known to efficiently extract such features, and there 
are deep mathematical connections between wavelets and DCNNs, as shown 
by Mallat and co-workers [18,476]. Note that in Fig.6.12|the input layer can be 
used to construct many layers by simply manipulating the activation function 
f(-) to the next layer as well the size of the convolutional window. 


Pooling Layers 


It is common to periodically insert a pooling layer between successive convo- 
lutional layers in a DCNN architecture. Its function is to progressively reduce 
the spatial size of the representation in order to reduce the number of param- 
eters and computation in the network. This is an effective strategy (i) to help 
control overfitting and (ii) to fit the computation in memory. Pooling layers op- 
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erate independently on every depth slice of the input and resize them spatially. 
Using the max operation, i.e., the maximum value for all the nodes in its convo- 
lutional window, is called max pooling. In image processing, the most common 
form of max pooling is a pooling layer with filters of size 2 x 2 applied with 
a stride of two downsamples every depth slice in the input by two along both 
width and height, discarding 75% of the activations. Every max pooling oper- 
ation would in this case be taking a max over four numbers (a 2 x 2 region in 
some depth slice). The depth dimension remains unchanged. An example max 
pooling operation is shown in Fig. |6.12(b), where a 3 x 3 convolutional cell is 
transformed to a single number that is the maximum of the nine numbers. 


Fully Connected Layers 


Occasionally, fully connected layers are inserted into the DCNN so that differ- 
ent regions can be connected. The pooling and convolutional layers are local 
connections only, while the fully connected layer restores global connectivity. 
This is another commonly used layer in the DCNN architecture, providing a 
potentially important feature space to improve performance. 


Dropout 


Overfitting is a serious problem in DCNNs. Indeed, overfitting is at the core 
of why DCNNs often fail to demonstrate good generalizability properties (see 
Chapter |4| on regression). Large DCNNs are also slow to use, making it dif- 
ficult to deal with overfitting by combining the predictions of many differ- 
ent large neural nets for online implementation. Dropout is a technique which 
helps address this problem. The key idea is to randomly drop nodes in the net- 
work (along with their connections) from the DCNN during training, i.e., dur- 
ing SGD/backprop updates of the network weights. This prevents units from 
co-adapting too much. During training, dropout samples form an exponential 
number of different “thinned” networks. This idea is similar to the ensemble 
methods for building random forests. At test time, it is easy to approximate 
the effect of averaging the predictions of all these thinned networks by simply 
using a single unthinned network that has smaller weights. This significantly 
reduces overfitting and has been shown to give major improvements over other 
regularization methods [672]. 


There are many other techniques that have been devised for training DCNNs, 
but the above methods highlight some of the most commonly used. The most 
successful applications of these techniques tend to be in computer vision tasks 
where DCNNs offer unparalleled performance in comparison to other machine 
learning methods. Importantly, the ImageNet data set is what allowed these 
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BOAO 


Figure 6.13: Representative images of the alphabet characters A, B, and C. There 
are a total of 1500 28 x 28 grayscale images (XTrain) of the letters that are labeled 
(TTrain). 


DCNN layers to be maximally leveraged for human-level recognition perfor- 
mance. 

To illustrate how to train and execute a DCNN, we use data from MATLAB. 
Specifically, we use a data set that has a training and test set with the alphabet 
characters A, B, and C (see Fig. (6.13). The training data, XTrain, contains 1500 
28 x 28 grayscale images of the letters A, B, and C in a four-dimensional array. 
There are equal numbers of each letter in the data set. The variable TTrain con- 
tains the categorical array of the letter labels, i.e., the truth labels. The following 
code constructs and trains a DCNN. 


Code 6.3: [MATLAB] Train a DCNN. 


layers = [imageInputLayer([28 28 1]); 
convolution2dLayer (5,16); 
reluLayer(); 
MaxPoolangZ7dhayer(2, “Stride? 2)? 
fubivcConnectedLaver (3); 
softmaxLayer(); 
clasisuiiicakrtonLayer()\]); 

operons — trainingoptions ( sgam); 

eng (dekauli” ) 2 For reproducibility 

net = trainNetwork (XTrain,TTrain, layers, options); 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


6.6. NEURAL NETWORKS FOR DYNAMICAL SYSTEMS 275 


Code 6.3: [Python] Train a DCNN. 


model = Sequential () 

model.add(Conv2D (filters=16, kernel_size=5, activation=’ relu 
d input_shape=(28,28,1))) 

model.add(MaxPool2D (pool_size=2, strides=2) ) 

l.add(Flatten()) 

model.add (Dense (len (classes), activation=’ softmax’)) 


5 
(6l 
Q 
0) 


sgd_optimizer = optimizers.SGD (momentum=0. 9) 
model.compile (optimizer=sgd_optimizer, loss=’ 
categorical_crossentropy’ ) 
model.fit(XTrain, y_train, epochs=30) 


Note the simplicity in how diverse network layers are easily put together. 
In addition, a ReLU activation layer is specified along with the training method 
of stochastic gradient descent (sgdm). The trainNetwork command integrates 
the options and layer specifications to build the best classifier possible. The 
resulting trained network can now be used on a test data set. 


Code 6.4: [MATLAB] Test the DCNN performance. 
| YTest = classify (net,XTest); 


Code 6.4: [Python] Test the DCNN performance. 


|| YPredict = np.argmax (model.predict (XTest) ,axis=1) 


The resulting classification performance is approximately 93%. One can see 
by this code structure that modifying the network architecture and specifica- 
tions is trivial. Indeed, one can probably easily engineer a network to outper- 
form the illustrated DCNN. As already mentioned, artistry and expert intuition 
are critical for producing the highest-performing networks. 


6.6 Neural Networks for Dynamical Systems 


Neural networks offer an amazingly flexible architecture for performing a di- 
verse set of mathematical tasks. To return to Mallat et al.: Supervised learning is 
a high-dimensional interpolation problem [476]. Thus, if sufficiently rich data can 
be acquired, NNs offer the ability to interrogate the data for a variety of tasks 
centered on classification and prediction. To this point, the tasks demonstrated 
have primarily been concerned with computer vision. However, NNs can also 
be used for future state predictions of dynamical systems (see Chapter|7). 

To demonstrate the usefulness of NNs for applications in dynamical sys- 
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tems, we will consider the Lorenz system of differential equations [460] 


z=o(y-2), (6.32a) 
ý = x(p — z) =y, (6.32b) 
Žž = £y — Bz, (6.32c) 
where the state of the system is given by x = |x y z]’ with the parameters 


o = 10, p = 28, and 8 = 8/3. This system will be considered in further detail 
in the next chapter. For the present, we will simulate this nonlinear system and 
use it as a demonstration of how NNs can be trained to characterize dynamical 
systems. Specifically, the goal of this section is to demonstrate that we can train 
a NN to learn an update rule which advances the state space from x, to X;41, 
where k denotes the state of the system at time tų. Accurately advancing the 
solution in time requires a nonlinear transfer function, since the Lorenz system 
itself is nonlinear. 

The training data required for the NN is constructed from high-accuracy 
simulations of the Lorenz system. The following code generates a diverse set 
of initial conditions. One hundred initial conditions are considered in order to 
generate 100 trajectories. The sampling time is fixed at At = 0.01. Note that 
the sampling time is not the same as the time-steps taken by the fourth-order 
Runge-Kutta method [420]. The time-steps are adaptively chosen to meet the 
stringent tolerances of accuracy chosen for this example. 


Code 6.5: [MATLAB] Create training data of Lorenz trajectories. 
dE- 00l; T3; C- Onde: T; 
b=8/3; sig=10; r=28; 


Werer = aee S sees ee e aiCik)))) s 
san a s (lle) S A a oe (839) E aa 
sD aU =- bx (3) Is 
ode_options = odeset(’RelTol’,le-10, ’AbsTol’,1le-11); 


input=[]; output=[]; 

for j=1:100 % training trajectories 
x0=30« (rand(3,1)-0.5); 
[t,y] = ode45(Lorenz,t,x0); 
input=[input; y(l:end-1,:)]; 
output=[output; y(2:end,:) ] 


, 


end 


Code 6.5: [Python] Create training data of Lorenz trajectories. 


dt = Orol T — 8) © — np.arange (0, Ttde, dt) 
beta = 8/3; sigma = 10; rho = 28 
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nn_input = np.zeros((100* (len(t)-1),3)) 

nn_output = np.zeros_like(nn_input) 

def lorenz_deriv(x_y_z,t0,sigma=sigma, beta=beta, rho=rho) : 
ay Vr 2 a yA 


return [Sigma*(y-x), x*(rho-z)-y, x*y-beta+*z] 


x0 = =15 + 30 * np. random: random (100,  3))) 
xt = np.asarray ([integrate odeint (lorenz deriv, x0 J; t) 
for x0 J in x0l) 


for j in range (100): 
nn input js (Tent) 1): (J1) (Ten(t) 7: = x tl 
nn outputs (ren(t) 1): Gr Na Ten (e D — a a e a 


The simulation of the Lorenz system produces two key matrices: input and 
output. The former is a matrix of the system at x;,, while the latter is the corre- 
sponding state of the system x;,4; advanced At = 0.01. 

The NN must learn the nonlinear mapping from x; to x;+1. Figure [6.14] 
shows the various trajectories used to train the NN. Note the diversity of initial 
conditions and the underlying attractor of the Lorenz system. 

We now build a NN trained on trajectories of Fig. [6.14] to advance the solu- 
tion At = 0.01 into the future for an arbitrary initial condition. Here, a three- 
layer network is constructed with 10 nodes in each layer and a different acti- 
vation unit for each layer. The choice of activation types, nodes in the layer, 
and number of layers are arbitrary. It is trivial to make the network deeper and 
wider and enforce different activation units. The performance of the NN for 
the arbitrary choices made is quite remarkable and does not require additional 
tuning. The NN is built with the following few lines of code. 


Code 6.6: [MATLAB] Build a neural network for Lorenz system. 


net = feedforwardnet([10 10 10]); 
net.layers{1l}.transferFcn = ’logsig’; 
net.layers{2}.transferFcn = ’radbas’; 
net.layers{3}.transferFcn = ’purelin’; 
net = train(net,input.’,output.’); 


Code 6.6: [Python] Build a neural network for Lorenz system. 


net = keras.models.Sequential () 

net.add(layers.Dense(10, input_dim=3, activation=’ sigmoid’ ) ) 
net.add(layers.Dense(10, activation=’ relu’ ) ) 
net.add(layers.Dense(3, activation=’ linear’ )) 

n 

H 


et .compile (loss=’mse’, optimizer=’ adam’ ) 
listory = net.fit(nn_input, nn output, epochs=1000) 
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Figure 6.14: Evolution of the Lorenz dynamical equations for 100 randomly 
chosen initial conditions (red circles). For the parameters o = 10, p = 28, and 
B = 8/3, all trajectories collapse to an attractor. These trajectories, generated 
from a diverse set of initial data, are used to train a neural network to learn the 
nonlinear mapping from Xx to Xk+1- 


The code produces a function net which can be used with a new set of data to 
produce predictions of the future. Specifically, the function net gives the non- 
linear mapping from x, to x,41. Figure|6.15]shows the structure of the network 
along with the performance of the training over 1000 epochs of training. The re- 
sults of the cross-validation are also demonstrated. The NN converges steadily 
to a network that produces accuracies on the order of 107°. 

Once the NN is trained on the trajectory data, the nonlinear model map- 
ping x; to x;,4; can be used to predict the future state of the system from an 
initial condition. In the following code, the trained function net is used to take 
an initial condition and advance the solution At. The output can be reinserted 
into the net function to estimate the solution 2At into the future. This itera- 
tive mapping can produce a prediction for the future state as far into the future 
as desired. In what follows, the mapping is used to predict the Lorenz solu- 
tions eight time units into the future from a given initial condition. This can 
then be compared against the ground-truth simulation of the evolution using 
a fourth-order Runge-Kutta method. The following iteration scheme gives the 
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Best Validation Performance is 5.9072e-06 at epoch 1000 
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Figure 6.15: (a) Network architecture used to train the NN on the trajectory 
data of Fig. A three-layer network is constructed with 10 nodes in each 
layer and a different activation unit for each layer. (b) Performance summary 
of the NN optimization algorithm. Over 1000 epochs of training, accuracies on 
the order of 10~° are produced. The NN is also cross-validated in the process. 


NN approximation to the dynamics. 


Code 6.7: [MATLAB] Neural network for prediction. 
yn (ll, 3) =x 0); 
for jj=2:length (t) 
yO=net (x0); 
ynn(jj,:)=yO.’% x0=y0; 
end 


Code 6.7: [Python] Neural network for prediction. 


yan = np.zeros((num_traj, len(t), 3)) 
Von O eee lS SOA pe ranom. randon (Num eraa 3) 
for jj, tval in enumerate (t[:-1]): 
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20 
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Figure 6.16: Comparison of the time evolution of the Lorenz system (solid line) 
with the NN prediction (dotted line) for two randomly chosen initial conditions 
(red dots). The NN prediction stays close to the dynamical trajectory of the 


Lorenz model. A more detailed comparison is given in Fig. 


| vaalee gea | = nee PEE dI CE SAn: o sal) 


Figure [6.16]shows the evolution of two randomly drawn trajectories (solid 
lines) compared against the NN prediction of the trajectories (dotted lines). The 
NN prediction is remarkably accurate in producing an approximation to the 
high-accuracy simulations. This shows that the data used for training is capable 
of producing a high-quality nonlinear model mapping x+ to x,+1. The quality 
of the approximation is more clearly seen in Fig. [6.17|where the time evolution 
of the individual components of x are shown against the NN predictions. See 
Section [Z.5]for further details. 

In conclusion, the NN can be trained to learn dynamics. More precisely, the 
NN seems to learn an algorithm which is approximately equivalent to a fourth- 
order Runge-Kutta scheme for advancing the solution a time-step At. Indeed, 
NNs have been used to model dynamical systems and other physical 
processes for decades. However, great strides have been made recently 
in using DNNs to learn Koopman embeddings, resulting in several excellent 
papers [766]. For example, the VAMPnet architecture 
uses a time-lagged autoencoder and a custom variational score to 
identify Koopman coordinates on an impressive protein folding example. In an 
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Figure 6.17: Comparison of the time evolution of the Lorenz system for two 
randomly chosen initial conditions (also shown in Fig. (6.16). The left column 
shows that the evolution of the Lorenz differential equations and the NN map- 
ping give identical results until t ~ 5.5, at which point they diverge. In contrast, 
the NN prediction stays on the trajectory of the second initial condition for the 
entire time window. 


alternative formulation, variational autoencoders can build low-rank models 
that are efficient and compact representations of the Koopman operator from 
data [465]. By construction, the resulting network is both parsimonious and 
interpretable, retaining the flexibility of neural networks and the physical in- 
terpretation of Koopman theory. In all of these recent studies, DNN represen- 
tations have been shown to be more flexible and exhibit higher accuracy than 
other leading methods on challenging problems. 


6.7 Recurrent Neural Networks 


Recurrent neural networks (RNNs) are an important class of neural network ar- 
chitectures that leverage sequential data streams. Sequential data is prevalent 
in speech recognition, as sentences and phrases have specific temporal struc- 
tures in order to produce output that is meaningful. RNNs are trained by re- 
specting the time history of a given sequence. Thus, unlike the standard feed- 
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J | => fọ latent space 
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Figure 6.18: Recurrent neural network structure which trains on sequences of 
input data x; = x(t) and output data y; = y (tx). Unlike a feedforward neural 
network, the time history of the sequence, or memory, is used to train the neu- 
ral network. Thus the output of the neural network is fed back into the latent 
layer fọ. The left representation of the RNN shows the recurrent structure that 
is achieved from feeding the output back into the neural network. The right 
representation is the unfolding of the graph, which shows a neural network fg 
that shares weights across different time points. 


forward neural networks of the last section, the time history of the sequences, 
or memory, is explicitly accounted for. Figure shows the neural network 
architecture of a generic RNN, where a sequence with m snapshots of temporal 
history is trained to learn a representation of the sequence. 

RNNs are structured around key ideas in dynamical systems. Specifically, 
we often think of dynamics in terms of a flow map 


Xk+1 = £(xx, 0), (6.33) 


which advances a solution forward in time from x, = x(t,) to Xk+1 = X(tk+1). 
In the last section, we constructed a feedforward neural network to model the 
flow map 

Xk+ = fo(xx), (6.34) 


where @ are the network weights. An RNN trains over a sequence of temporal 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


6.7. RECURRENT NEURAL NETWORKS 283 


snapshots, which have an associated output yx, so that 
Xk+m = fo (fo -- fo(xk)---))- (6.35) 


This expression is a flow map over m steps from Xp = X(tp) tO Xk4m = X(tk4m). 
This is the unfolding of the recursive graph in Fig. 

It is also often the case that, instead of mapping a sequence in the original 
input variable x+, the latent space is used for building a map of the recurrence 
and to the output sequence yx. If hy = h(t) is the latent representation, then 
the model becomes 

hr+1 = fo(hr, Xk+1), (6.36) 


where the neural network is now dependent on the input variable. In either 
case, a neural network fg is trained to advance the solution in time by training 
on trajectories from time tı to tm. Thus the big difference between the feed- 
forward neural networks of the last chapter and RNNs is that: RNNs train on 
trajectories from t; to tm using the entire sequence of time points, whereas feed- 
forward NNs train from t; to tp+ı (or ty tO tg}m as in Section|12.6). Training over 
trajectories allows for the history (memory) of the solution to shape the neural 
network model. 

The history of RNNs begins in the 1980s with the foundational work of 
Rumelhart et al. and Hopfield [337]. RNNs in the form of long short-term 
memory (LSTM) networks became especially transformative in speech recog- 
nition applications since an LSTM, through its filtering architecture, regular- 
ize RNNs to avoid the vanishing gradient problem that is typically encoun- 
tered in training. Other RNN architectures that have been constructed in order 
to avoid the vanishing or exploding gradients problem include gated recurrent 
units (GRU) and echo state networks (ESN). Thus LSTM, GRU, and ESN, along 
with their variants, are commonly used with time-series data. 

Training on the Lorenz dynamical system model of the last section is sim- 
ilar to building the feedforward network already considered. In this case, an 
LSTM model is built from portions of the trajectory data. The trained model is 
compared against the evolution dynamics in Fig. The LSTM generates a 
behavior that mimics the dynamics of Lorenz equations using trajectories of 40 
data points. A simple RNN can be constructed instead of an LSTM by uncom- 
menting from the code below. 


Code 6.8: [Python] LSTM model for dynamics. 


sequence_size = 40; train_size = 80; test_size = 20 

rnn input = np.zeros((train_sizex (len(t)-sequence_size-1l), 
sequence sd Ze, a) 

rnn_output = np.zeros((train_sizex (len(t)-sequence_size-1), 
21) 
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740 


Figure 6.19: Trajectories of the Lorenz dynamical system (black dotted line) ver- 
sus the trajectories learned by an LSTM (solid red line). The fit can be improved 
by hyperparameter tuning of the sequence size and training time. The specific 
training trajectory length selected here was a sequence of 40 time points. 


for j in range(train_size): 
for k in range (len(t)-sequence_size-1): 


rnn_input[j* (len (t)—-sequence size 1) + k,:] = x tllj k:kt 
sequence_size,:] 

rnn OuUEpUE a+ (Tenis) seguence size I) t koll = x EIJ kr 
] 


sequence_size,: 


model = Sequential () 

model.add(LSTM(16, input_shape=(None, 3))) 

# model.add(SimpleRNN (16, input_shape=(None, 3))) 
model.add (Dense (3) ) 


sgd = SGD(0.01) 
model.compile (optimizer=sgd, loss=’mean_squared_error’ ) 
model.fit(rnn_input, rnn_output, epochs=20) 
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6.8 Autoencoders 


Autoencoder neural networks are a flexible and advantageous structure for 
exploiting low-dimensional features in high-dimensional data. They are fea- 
tured here since many scientific and engineering applications leverage low- 
dimensional coordinate systems for building parsimonious models character- 
izing a physical process. The autoencoder generalizes the linear subspace em- 
bedding of SVD/PCA to a nonlinear manifold embedding, often of a lower 
dimension. Specifically, the autoencoder maps the original high-dimensional 
input vectors x; € R” to a low-dimensional latent variable z; € R” and then 
back to the high-dimensional space x, which is technically the output y. The 
goal of the autoencoder is to map the output back to itself, i.e., |X — x||) ~ 0. 
Typically r < n for autoencoding and mathematically 


Z = 6(X), (6.37) 


where Z is the latent space data and X is the input high-dimensional data. 
Note that the columns of Z are z; and the columns of X are x;. Decoding is 
represented as 

X = v(Z), (6.38) 


where the neural network weights are optimized so that the output X is as close 
as possible to the input, 


argmin |X — X||? = argmin ||X — fo (X) ||, (6.39) 
0 0 


where @ are the weights of the autoencoder network f(x) = y(¢(x)). The di- 
mension of the latent space r is often determined by hyperparameter tuning. 
Thus r is made as small as possible until the autoencoder performance starts to 
fail. This is often informative, as it can discover the intrinsic dimensionality of 
the data. 

From a more mathematically abstract point of view, the autoencoder pro- 
vides a mapping, as illustrated in Fig. so that 


p: X >Z, (6.40a) 
pZ, (6.40b) 


where the input x € ¥ C R” and output z € Z C R” are defined in high- and 
low-dimensional spaces, respectively. The resulting neural network optimiza- 
tion is formulated around the loss function 


au IX — (y ° @)X||. (6.41) 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


286 CHAPTER 6. NEURAL NETWORKS AND DEEP LEARNING 


x Encoder network Decoder network x 


o 


SA DOKL RE 
PROS ORK 
= ZE VX 


(latent space) 


Input Reconstruction 


Z 


Figure 6.20: Autoencoder network structure which maps the input state space 
x to a latent space z. For applications considered here, the latent space if where 
the dynamical evolution is modeled. The encoder is denoted by ¢ and the de- 
coder by w. The autoencoder is trained by minimizing the loss ||x — x||5 along 
with any other regularization that may be applied. 


More generally, the autoencoder is often used with a diversity of regulariza- 
tions in the optimization process, so that a more general formulation is given 


by 


P 
argmin |X — (po d)Xl| + ) Aig, 6%), (6.42) 
Gal 
where g;(-) represents a regularizer, weighted by 4,, and there are total of P 
additional loss functions added to the optimization. For instance, an elastic net 
penalty can be added where two additional loss functions would penalize the 
lə- and ¢;-norm of the network weights. This often can help produce better 
results in the network. The ¢;-norm, for instance, can help deal with outliers 
and corrupt data. 

To demonstrate the ability of the autoencoder to construct a low-dimensional 
representation of high-dimensional data, the fluid flow around a cylinder is 
considered. The data is generated from snapshots of the numerical simulation 
of the incompressible Navier-Stokes equation: 

a +u-Vu+ Vp —- =Vu =), (6.43) 
with the incompressibility constraints V - u = 0. Here u(x, y,t) represents the 
2D velocity, and p(x, y,t) the corresponding pressure field. The boundary con- 
ditions dictate a constant flow of u = (1,0)? at x = —15, i.e., the entry of the 
domain. There is also a constant pressure of p = 0 at x = 25, i.e., the end of the 
domain, and Neumann boundary conditions, i.e., du/On = 0, on the boundary 
of the domain and the cylinder (centered at (x, y) = (0,0) and of radius unity). 
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Figure 6.21: First six most dominant modes (top left to bottom right) learned 


by the autoencoder for the flow around a cylinder. These modes are the latent 
representation of the flow physics. 


Figure 6.22: First six most dominant modes (top left to bottom right) learned by 
the autoencoder for the flow around a cylinder using a linear reduction, which 
is equivalent to ọ — U* and ~ — U. These modes are the SVD (PCA or POD) 
representation of the flow physics. 


The autoencoder is created using the keras package with tensorflow. The 
following code creates an autoencoder/decoder structure, whereby the input 
layer is recursively made half the size of the layers before it. Three layers are 
constructed on the way to the latent dimension of r = 10. 


Code 6.9: [Python] Autoencoder structure. 


class Autoencoder (tf.keras.Model): 


def init (self, latent dim, inpute dim activat ton 
sigmoid’): 
super (Autoencoder, self). _init_ () 


self.latent_dim = latent_dim 
Selt input dim — np td im 
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self.activation = activation 


self.encoder = tf.keras.Sequential ([ 
tf.keras.layers.Dense(int(self.input_dim/2), 


activation=self.activation), 
tf.keras.layers.Dense (int (self.input_dim/4), 
activation=self.activation), 
tf.keras.layers.Dense (int (self.input_dim/8), 
activation=self.activation), 
tf.keras.layers.Dense(self.latent_dim, 


activation=’ linear’), 


self.decoder = tf.keras.Sequential ([ 
f.keras.layers.Dense (int (self.input_dim/8), 


activation=self.activation), 


tf.keras.layers.Dense (int (self.input_dim/4), 

activation=self.activation), 

tf.keras.layers.Dense (int (self.input_dim/2), 
activation=self.activation), 


tf.keras.layers.Dense(self.input_dim, activation 
='linear’), 


}) 


def call(self, x): 
encoded = self.encoder (x) 
decoded = self.decoder (encoded) 
return decoded 


To train the autoencoder model, the following code is used: 


Code 6.10: [Python] Autoencoder training. 


latent_dim = 10 # number of modes 
activation = ‘elu’ 
inpute dim = x traansshape [1] 


optimizer = ’adam’; epochs = 50 

A = Autoencoder(latent_dim, input_dim, activation) 

A.compile (optimizer=optimizer, loss=tf.keras.losses. 
MeanSquaredError () ) 


The encoding generates a low-dimensional representation of the flow field 
in the latent space. The first six modes of this nonlinear encoder are highlighted 
in Fig. This should be compared to a linear encoding/decoding, which 
is achieved using PCA so that the encoder/decoder pair (¢, 7) are given by 
the first r modes of the SVD (U*, U). The linear modes are shown in Fig. 
Note that the linear modes alternate between symmetric and antisymmetric 
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Figure 6.23: Error of the autoencoder representation. The left panel is the orig- 
inal state space u to be encoded. The middle panel is the reconstructed state 
space u. And finally the right panel is the error between the original state space 
and its reconstruction u — U. 


structure whereas the nonlinear encoder produces modes that don’t have any 
clear symmetry. Thus linear and nonlinear encoding can produce quite differ- 
ent patterns. It is shown later in Section [13.7|that there are many advantages 
to the nonlinear decoder in handling noise. Figure compares a snapshot 
of the flow x along with its reconstruction u through the autoencoder/decoder 
network. The reconstruction error u — ù is also shown. 


6.9 Generative Adversarial Networks (GANSs) 


Deep learning has also produced success in the generation of synthetic data that 
is indistinguishable from real data. Generative adversarial networks (GANSs) learn 
how to produce synthetic data through an adversarial structure whereby two 
neural networks are trained simultaneously. One neural network, the discrim- 
inator, classifies sample data as real or fake. A second neural network, the gen- 
erator, produces synthetic data from a latent representation that is run through 
the discriminator to produce a classification of real or fake. The two neural net- 
works are trained simultaneously so that the generator can produce synthetic 
data, or fake data, that is indistinguishable from real data. 

Goodfellow et al. developed the basic architecture of the GAN as illus- 
trated in Fig. Mathematically, the architecture considers a set of real data 
X € R"“™. Each data sample x; € R” is used as input to a neural network that 
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x, € R” — data sample 
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Figure 6.24: Generic GAN architecture that requires the training of two neural 
networks. Individual samples of data x, € R” are used to train a neural net- 
work fg, that maps the input data to a classification label (real or fake) y;, € R”. 
The generative network fọ, maps a random latent space z € R? to model of 
the data x; € R”. The generative data is classified by fg, as real or fake data 
by yx € R”. Backprop is used to update the neural network weights in both 
networks so that the generator network produces a model of the data x, € R” 
that is indistinguishable from real data x, € R”. 


maps it to a classification y, of real or fake: 
Yr = fo, (Xk). (6.44) 


Note that the output, for instance, could be given by y; = [1 0]’ (real) and 
k = [0 1)” (fake). Without a generative model, this task is trivial, as all the 
data is, of course, real, so that the label will always be real. A second network is 
trained to generate synthetic data samples X+. Specifically, the goal is to make 
data X, indistinguishable from real data x,. This is done by constructing an 
input latent space z, for the second neural network that generates the synthetic 

data 
Xk = fo, (Zk). (6.45) 


The latent space z% is typically a random vector. By training the network, the 
random vector then produces fake data x,. The fake data is also then used as 
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an input to the discrimination network 

Fi = fo, (Kx). (6.46) 
The output classification vector labels each synthetic data sample as real or 
fake. Initially, these vectors are largely labeled as fakes. However, by training 
fo, and fg, jointly, the generator network can learn to produce a model that min- 
imizes the number of generated models x, that are labeled as fake. Two loss 
functions are simultaneously considered: (i) £,,, which maximizes the proba- 
bility of assigning the correct label to both training examples x, and samples 
from the generator x;; and (ii) Lig) which minimizes the number of synthetic 
data labeled as fakes. Thus the second loss function must compute 


Vr = fo, (Xk) = fo, (fo. (Zx)) (6.47) 
in order to produce the labels y;, which are all desired to be considered real. 
A highly successful outcome would mean that the true data y; and synthetic 
labels ï, are all classified as real labels. 

GANs gained significant popularity due to their ability to generate deep fakes 
(images, video, and speech) that are difficult to distinguish from reality. It is one 
of the more controversial neural network architectures to have been developed 
to date. However, it also has advantages when applied to various disciplines 
in science and engineering. To highlight one of the uses of GAN for scientific 
purposes, we consider the use of GANs for the super-resolution of turbulent 
flow physics (see Fig|6.25). In the application of super-resolution, the la- 
tent space no longer consists of random inputs. Rather, the latent space z+ is 
the low-resolution flow field. Thus the generator network is trained to pro- 
duce high-resolution flow physics x; that is indistinguishable from real data 
x,. The generator (fg,) and discriminator (fg,) neural networks are quite so- 
phisticated, being composed of convolutional layers (CONV), parametric ReLU 
(PReLU), batch normalization (BN), and leaky ReLU (LReLU) layers. Deng et 
al. show that the GAN is capable of producing high-dimensional recon- 
structions using low-resolution measurements, thus showing the effectiveness 
of the method. More broadly, one can imagine using GANs to generate syn- 
thetic data that can be useful in various scientific and engineering applications 
where high-resolution fields are expensive. 


6.10 The Diversity of Neural Networks 


There are a wide variety of NN architectures, with only a few of the most dom- 
inant architectures considered thus far. This chapter (and book) does not at- 
tempt to give a comprehensive assessment of the state of the art in neural net- 
works. Rather, our focus is on illustrating some of the key concepts and en- 
abling mathematical architectures that have led NNs to a dominant position in 
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Figure 6.25: Architecture for the super-resolution reconstruction of turbulent 
velocity fields. In this case, the latent variable space z, of Fig. [6.24] is the low- 
resolution (LR) version of the flow field. The network is trained so that the LR 
flow field can produce a synthetic version of the high-resolution (HR) flow field 
X, that is indistinguishable from the real data. The generator (fo,) and discrim- 
inator (fọ ) neural network architectures are composed of convolutional lay- 
ers (CONV), parametric ReLU (PReLU), batch normalization (BN), and leaky 
ReLU (LReLU). (From Deng et al. [202]). 


modern data science. For a more in-depth review, please see Goodfellow et al. 
|290]. However, to conclude this chapter, we would like to highlight some of 
the NN architectures that are used in practice for various data science tasks. 
This overview is inspired by the neural network zoo as highlighted by Fjodor 
van Veen of the Asimov Institute (www. asimovinstitute.org). 

The neural network zoo highlights some of the different architectural struc- 
tures around NNs. Some of the networks highlighted are commonly used across 
industry, while others serve niche roles for specific applications. Regardless, 
this demonstrates the tremendous variability and research effort focused on 
NNs as a core data science tool. Figure|6.26|highlights the prototype structures 
to be discussed in what follows. Note that the bottom panel has a key to the dif- 
ferent type of nodes in the network, including input cells, output cells, and hid- 
den cells. Additionally, the hidden layer NN cells can have memory effects, ker- 
nel structures, and/or convolution/ pooling. For each NN architecture, a brief 
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description is given along with the original paper proposing the technique. 


Perceptron 


The first mathematical model of NNs by Fukushima was termed the Neocogni- 
tron in 1980 [260]. His model had a single layer with a single output cell called 
the perceptron, which made a categorial decision based on the sign of the out- 
put. Figure |6.2|shows this architecture to classify between dogs and cats. The 
perceptron is an algorithm for supervised learning of binary classifiers. 


Feedforward (FF) 


Feedforward networks connect the input layer to the output layer by form- 
ing connections between the units so that they do not form a cycle. Figure 
has already shown a version of this architecture where the information simply 
propagates from left to right in the network. It is often the workhorse of su- 
pervised learning where the weights are trained so as to best classify a given 
set of data. A feedforward network was used in Figs. |6.5] and for train- 
ing a classifier for dogs versus cats and for predicting time-steps of the Lorenz 
attractor, respectively. An important subclass of feedforward networks is deep 
feedforward (DFF) NNs. DFFs simply put together a larger number of hidden 
layers, typically 7-10 layers, to form the NN. A second important class of FF 
is the radial basis network, which uses radial basis functions as the activation 
units [121]. Like any FF network, radial basis function networks have many 
uses, including function approximation, time-series prediction, classification, 
and control. 


Recurrent Neural Network (RNN) 


Illustrated in Fig.|6.26(a), RNNs are characterized by connections between units 
that form a directed graph along a sequence. This allows an RNN to exhibit dy- 
namic temporal behavior for a time sequence [230]. Unlike feedforward neural 
networks, RNNs can use their internal state (memory) to process sequences of 
inputs. The prototypical architecture in Fig. |6.26{a) shows that each cell feeds 
back on itself. This self-interaction, which is not part of the FF architecture, al- 
lows for a variety of innovations. Specifically, it allows for time delays and/or 
feedback loops. Such controlled states are referred to as gated state or gated 
memory, and are part of two key innovations: long short-term memory (LSTM) 
networks and gated recurrent units (GRU) [177]. LSTM is of particular im- 
portance, as it revolutionized speech recognition, setting a variety of perfor- 
mance records and outperforming traditional models in a variety of speech 
applications. GRUs are a variation of LSTMs that have been demonstrated to 
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Figure 6.26: Neural network architectures commonly considered in the litera- 
ture. The NNs comprise input nodes, output nodes, and hidden nodes. Ad- 
ditionally, the nodes can have memory, perform convolution and/or pooling, 
and perform a kernel transformation. Each network and their acronym are ex- 
plained in the text. 
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exhibit better performance on smaller data sets. 


Autoencoder (AE) 


The aim of an autoencoder, represented in Fig. |6.26{b), is to learn a represen- 
tation (encoding) for a set of data, typically for the purpose of dimensionality 
reduction. For AEs, the input and output cells are matched so that the AE is 
essentially constructed to be a nonlinear transform into and out of a new rep- 
resentation, acting as an approximate identity map on the data. Thus AEs can 
be thought of as a generalization of linear dimensionality reduction techniques 
such as PCA. AEs can potentially produce nonlinear PCA representations of 
the data, or nonlinear manifolds on which the data should be embedded [99]. 
Since most data lives in nonlinear subspaces, AEs are an important class of NN 
for data science, with many innovations and modifications. Three important 
modifications of the standard AE are commonly used. The variational autoen- 
coder (VAE) (shown in Fig. |6.26{c)) is a popular approach to unsupervised 
learning of complicated distributions. By making strong assumptions concern- 
ing the distribution of latent variables, it can be trained using standard gradient 
descent in order to provide a good assessment of data in an unsupervised fash- 
ion. The de-noising autoencoder (DAE) (shown in Fig. |6.26{c)) takes a par- 
tially corrupted input during training to recover the original undistorted input. 
Thus noise is intentionally added to the input in order to learn the nonlinear 
embedding. Finally, the sparse autoencoder (SAE) (shown in Fig. |6.26{d)) 
imposes sparsity on the hidden units during training, while having a larger 
number of hidden units than inputs, so that an autoencoder can learn useful 
structures in the input data. Sparsity is typically imposed by thresholding all 
but the few strongest hidden unit activations. 


Markov Chain (MC) 


A Markov chain is a stochastic model describing a sequence of possible events 
in which the probability of each event depends only on the state attained in 
the previous event. So, although not formally a NN, it shares many common 
features with RNNs. Markov chains are standard even in undergraduate prob- 
ability and statistics courses. Figure 6.26f) shows the basic architecture where 
each cell is connected to the other cells by a probability model for a transition. 


Hopfield Network (HN) 


A Hopfield network is a form of RNN which was popularized by John Hop- 
field in 1982 for understanding human memory [837]. Figure g) shows the 
basic architecture of an all-to-all connected network where each node can act as 
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an input cell. The network serves as a trainable content-addressable associative 
memory system with binary threshold nodes. Given an input, it is iterated on 
the network with a guarantee to converge to a local minimum. Sometimes it 
converges to a false pattern, or memory (wrong local minimum), rather than 
the stored pattern (expected local minimum). 


Boltzmann Machine (BM) 


The Boltzmann machine, sometimes called a stochastic Hopfield network with 
hidden units, is a stochastic, generative counterpart of the Hopfield network. 
They were one of the first neural networks capable of learning internal rep- 
resentations, and are able to represent and (given sufficient time) solve dif- 
ficult combinatoric problems [327]. Figure [6.26{h) shows the structure of the 
BM. Note that, unlike Markov chains (which have no input units) or Hopfield 
networks (where all cells are inputs), the BM is a hybrid which has a mixture 
of input cells and hidden units. Boltzmann machines are intuitively appealing 
due to their resemblance to the dynamics of simple physical processes. They 
are named after the Boltzmann distribution in statistical mechanics, which is 
used in their sampling function. 


Restricted Boltzmann Machine (RBM) 


Introduced under the name Harmonium by Paul Smolensky in 1986 [666], RBMs 
have been proposed for dimensionality reduction, classification, collaborative 
filtering, feature learning, and topic modeling. They can be trained for either 
supervised or unsupervised tasks. G. Hinton helped bring them to prominence 
by developing fast algorithms for evaluating them [519]. RBMs are a subset of 
BMs where restrictions are imposed on the NN such that nodes in the NN must 
form a bipartite graph (see Fig. |6.26{e)). Thus a pair of nodes from each of the 
two groups of units (commonly referred to as the “visible” and “hidden” units, 
respectively) may have a symmetric connection between them; there are no 
connections between nodes within a group. RBMs can be used in deep learn- 
ing networks and deep belief networks by stacking RBMs and optionally fine- 
tuning the resulting deep network with gradient descent and backpropagation. 


Deep Belief Network (DBN) 


DBNs are a generative graphical model that are composed of multiple layers of 
latent hidden variables, with connections between the layers but not between 
units within each layer [73]. Figure |6.26{i) shows the architecture of the DBN. 
The training of the DBNs can be done stack by stack from AE or RBM layers. 
Thus each of these layers only has to learn to encode the previous network, 
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which is effectively a greedy training algorithm for finding locally optimal so- 
lutions. Thus DBNs can be viewed as a composition of simple, unsupervised 
networks such as RBMs and AEs where each sub-network’s hidden layer serves 
as the visible layer for the next. 


Deep Convolutional Neural Network (DCNN) 


DCNNs are the workhorse of computer vision and have already been consid- 
ered in this chapter. They are abstractly represented in Fig. |6.26{j), and in a 
more specific fashion in Fig. Their impact and influence on computer vi- 
sion cannot be overestimated. They were originally developed for document 
recognition [433]. 


Deconvolutional Network (DN) 


Deconvolutional networks, shown in Fig. |6.26{k), are essentially a reverse of 
DCNNs [770]. The mathematical structure of DNs permits the unsupervised 
construction of hierarchical image representations. These representations can 
be used for both low-level tasks such as de-noising, as well as providing fea- 
tures for object recognition. Each level of the hierarchy groups information 
from the level beneath to form more complex features that exist over a larger 
scale in the image. As with DCNNSs, it is well suited for computer vision tasks. 


Deep Convolutional Inverse Graphics Network (DCIGN) 


The DCIGN is a form of VAE that uses DCNNs for the encoding and decod- 
ing [417]. As with the AE/VAE/SAE structures, the output layer shown in 
Fig. [6.26(1) is constrained to match the input layer. DCIGNs combine the power 
of DCNNs with VAEs, which provides a formative mathematical architecture 
for computer vision and image processing. 


Generative Adversarial Network (GAN) 


In an innovative modification of NNs, the GAN architecture of Fig. |6.26{m) 
trains two networks simultaneously [291]. The networks, which are often a 
combination of DCNNs and/or FFs, train by one of the networks generating 
content which the other attempts to judge. Specifically, one network generates 
candidates and the other evaluates them. Typically, the generative network 
learns to map from a latent space to a particular data distribution of inter- 
est, while the discriminative network discriminates between instances from the 
true data distribution and candidates produced by the generator. The genera- 
tive network’s training objective is to increase the error rate of the discrimina- 
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tive network (i.e., “fool” the discriminator network by producing novel syn- 
thesized instances that appear to have come from the true data distribution). 
The GAN architecture has produced interesting results in computer vision for 
producing synthetic data, such as images and movies. 


Liquid State Machine (LSM) 


The LSM shown in Fig. [6.26{n) is a particular kind of spiking neural network 
[469]. An LSM consists of a large collection of nodes, each of which receives 
time-varying input from external sources (the inputs) as well as from other 
nodes. Nodes are randomly connected to each other. The recurrent nature of 
the connections turns the time-varying input into a spatio-temporal pattern of 
activations in the network nodes. The spatio-temporal patterns of activation are 
read out by linear discriminant units. This architecture is motivated by spiking 
neurons in the brain, thus helping understand how information processing and 
discrimination might happen using spiking neurons. 


Extreme Learning Machine (ELM) 


With the same underlying architecture of an LSM shown in Fig. |6.26{n), the 
ELM is a FF network for classification, regression, clustering, sparse approxi- 
mation, compression, and feature learning with a single layer or multiple layers 
of hidden nodes, where the parameters of hidden nodes (not just the weights 
connecting inputs to hidden nodes) need not be tuned. These hidden nodes can 
be randomly assigned and never updated, or can be inherited from their ances- 
tors without being changed. In most cases, the output weights of hidden nodes 
are usually learned in a single step, which essentially amounts to learning a 


linear model [150]. 


Echo State Network (ESN) 


ESNs are RNNs with a sparsely connected hidden layer (with typically 1% con- 
nectivity). The connectivity and weights of hidden neurons have memory and 
are fixed and randomly assigned (see Fig. |6.26{0)). Thus, like LSMs and ELMs, 
they are not fixed into a well-ordered layered structure. The weights of out- 
put neurons can be learned so that the network can generate specific temporal 


patterns [347]. 


Deep Residual Network (DRN) 


DRNs took the deep learning world by storm when Microsoft Research re- 
leased Deep Residual Learning for Image Recognition [317]. These networks 
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led to first-place-winning entries in all five main tracks of the ImageNet and 
COCO 2015 competitions, which covered image classification, object detection, 
and semantic segmentation. The robustness of residual networks (ResNets) has 
since been proven by various visual recognition tasks and by non-visual tasks 
involving speech and language. DRNs are very deep FF networks where there 
are extra connections that pass from one layer to a layer two to five layers 
downstream. This then carries input from an earlier stage to a future stage. 
These networks can be 150 layers deep, which is only abstractly represented in 


Fig. [6.26{p). 


Kohonen Network (KN) 


Kohonen networks are also known as self-organizing feature maps [400]. KNs 
use competitive learning to classify data without supervision. Input is pre- 
sented to the KN as in Fig. [6.26{q), after which the network assesses which 
of the neurons closely match that input. These self-organizing maps differ from 
other NNs, as they apply competitive learning as opposed to error-correction 
learning (such as backpropagation with gradient descent), and in the sense that 
they use a neighborhood function to preserve the topological properties of the 
input space. This makes KNs useful for low-dimensional visualization of high- 
dimensional data. 


Neural Turing Machine (NTM) 


The NTM architecture implements a neural network controller coupled to an 
external memory resource (see Fig. |6.26{r)), which it interacts with through at- 
tentional mechanisms [294]. The memory interactions are differentiable end- 
to-end, making it possible to optimize them using gradient descent. Pairing the 
NTM with an LSTM controller can infer simple algorithms such as copying, 
sorting, and associative recall from input and output examples. 
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Homework 


Exercise 6-1. Download the code base for solving (i) a reaction—diffusion system 
of equations, and (ii) the Kuramoto-Sivashinsky (KS) equation. 


(a) Train a NN that can advance the solution from t to t + At for the KS 
equation. 


(b) Compare your evolution trajectories for your NN against using the ODE 
time-stepper provided with different initial conditions. 


(c) For the reaction—diffusion system, first project to a low-dimensional sub- 
space via the SVD and see how forecasting works in the low-rank vari- 
ables. 


For the Lorenz equations, consider the following. 


(d) Train an NN to advance the solution from t to t+ At for p = 10, 28, and 40. 
Now see how well your NN works for future state prediction for p = 17 
and p = 35. 


(e) See if you can train your NN to identify (for p = 28) when a transition 
from one lobe to another is imminent. Determine how far in advance you 
can make this prediction. (Note: You will have to label the transitions in a 
test set in order to do this task.) 


Exercise 6-2. Consider time-series data acquired from power grid loads, specif- 
ically: T. V. Jensen and P. Pinson. Re-Europe, a large-scale data set for modeling 
a highly renewable European electricity system. Scientific Data, 4:170175, 2017. 
Compare the forecasting capabilities of the following neural networks on the 
power grid data: (i) a feedforward neural network; (ii) an LSTM; (iii) an RNN; 
and (iv) an echo state network. Consider the performance of each under cross- 
validation for forecasting ranges of At into the future and N At into the future 
(where N > 1). 


Exercise 6-3. Download the flow around the cylinder data. Using the first P% 
of the temporal snapshots, forecast the remaining (100 — P)% future state data. 
Do this by training a neural network on the high-dimensional data and using: 
(i) a feedforward neural network; (ii) an LSTM; (iii) an RNN; and (iv) an echo 
state network. Determine the performance of the algorithms as a function of 
decreasing data P. 


Redo the forecasting calculations by training a model in the reduced subspace 
U from the singular value decomposition. Evaluate the forecasting performance 
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as a function of the percentage of the training data P and the rank of the re- 
duced space r. 


Exercise 6-4. Generate simulation data for the Kuramoto-Sivashinsky (KS) equa- 
tion in three distinct parameter regimes where non-trivial spatio-temporal dy- 
namics occurs. Using a convolutional neural network, map the high-dimensional 
snapshots of the system to a classification of the system into one of the three pa- 
rameter regimes. Evaluate the performance of the classification scheme on test 
data as a function of different convolutional window sizes and stride lengths. 
For the best performance, what is the convolutional window size and what 
spatial length scale is extracted to make the classification decision? 
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Chapter 7 


Data-Driven Dynamical Systems 


Dynamical systems provide a mathematical framework to describe the world 
around us, modeling the rich interactions between quantities that co-evolve in 
time. Formally, dynamical systems concern the analysis, prediction, and un- 
derstanding of the behavior of systems of differential equations or iterative 
mappings that describe the evolution of the state of a system. This formula- 
tion is general enough to encompass a staggering range of phenomena, includ- 
ing those observed in classical mechanical systems, electrical circuits, turbulent 
fluids, climate science, finance, ecology, social systems, neuroscience, epidemi- 
ology, and nearly every other system that evolves in time. 

Modern dynamical systems began with the seminal work of Poincaré on 
the chaotic motion of planets. It is rooted in classical mechanics, and may be 
viewed as the culmination of hundreds of years of mathematical modeling, be- 
ginning with Newton and Leibniz. The full history of dynamical systems is 
too rich for these few pages, having captured the interest and attention of the 
greatest minds for centuries, and having been applied to countless fields and 
challenging problems. Dynamical systems provide one of the most complete 
and well-connected fields of mathematics, bridging diverse topics from linear 
algebra and differential equations, to topology, numerical analysis, and geom- 
etry. Dynamical systems have become central in the modeling and analysis of 
systems in nearly every field of the engineering, physical, and life sciences. 

Modern dynamical systems are currently undergoing a renaissance, with 
analytical derivations and first-principles models giving way to data-driven 
approaches. The confluence of big data and machine learning is driving a para- 
digm shift in the analysis and understanding of dynamical systems in science 
and engineering. Data are abundant, while physical laws or governing equa- 
tions remain elusive, as is true for problems in climate science, finance, epi- 
demiology, and neuroscience. Even in classical fields, such as optics and turbu- 
lence, where governing equations do exist, researchers are increasingly turning 
toward data-driven analysis. Many critical data-driven problems, such as pre- 
dicting climate change, understanding cognition from neural recordings, pre- 
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dicting and suppressing the spread of disease, or controlling turbulence for 
energy-efficient power production and transportation, are primed to take ad- 
vantage of progress in the data-driven discovery of dynamics. 

In addition, the classical geometric and statistical perspectives on dynam- 
ical systems are being complemented by a third operator-theoretic perspective, 
based on the evolution of measurements of the system. This so-called Koopman 
operator theory is poised to capitalize on the increasing availability of measure- 
ment data from complex systems. Moreover, Koopman theory provides a path 
to identify intrinsic coordinate systems to represent nonlinear dynamics in a 
linear framework. Obtaining linear representations of strongly nonlinear sys- 
tems has the potential to revolutionize our ability to predict and control these 
systems. 

This chapter presents a modern perspective on dynamical systems in the 
context of current goals and open challenges. Data-driven dynamical systems 
is a rapidly evolving field, and therefore we focus on a mix of established and 
emerging methods that are driving current developments. In particular, we will 
focus on the key challenges of discovering dynamics from data and finding 
data-driven representations that make nonlinear systems amenable to linear 
analysis. 


7.1 Overview, Motivations, and Challenges 


Before summarizing recent developments in data-driven dynamical systems, it 
is important to first provide a mathematical introduction to the notation and 
summarize key motivations and open challenges in dynamical systems. 


Dynamical Systems 


Throughout this chapter, we will consider dynamical systems of the form 


d 
Txe) = F(x(t), 8), (7.1) 


where x is the state of the system and f is a vector field that possibly depends 
on the state x, time t, and a set of parameters (3. 
For example, consider the Lorenz equations [460] 


£=o(y-2), (7.2a) 
y= x{p — z) — y, (7.2b) 
Z= £y — Bz, (7.2c) 


with parameters ø = 10, p = 28, and p = 8/3. A trajectory of the Lorenz system 
is shown in Fig. In this case, the state vector is x = [x y z]’ and the 
parameter vector is 3 = |o p E 
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Figure 7.1: Chaotic trajectory of the Lorenz system from (7.2). 


The Lorenz system is among the simplest and most well-studied dynamical 
systems that exhibit chaos, which is characterized as a sensitive dependence on 
initial conditions. Two trajectories with nearby initial conditions will rapidly 
diverge in behavior, and, after long times, only statistical statements can be 
made. 

It is simple to simulate dynamical systems, such as the Lorenz system. First, 
the vector field f(x, t; B) is defined in the function lorenz in Code]7.1] 


Code 7.1: [MATLAB] Define Lorenz vector field. 


function dx = lorenz(t,x,Beta) 
dx = [ 

Be ward) a (se (2) y 

oe (1) (BS ear (Ai) oe (S))8) ets) 

X(T) sx (2) Beta (3) +x (2); 


l; 


Code 7.1: [Python] Define Lorenz vector field. 


def lorenz(x_y_z, t0, sigma=sigma, beta=beta, rho=rho): 
X, yY, ee ee 
return [Ssigmax (y x), xs(rho-z) y, x*xy—-bevaxz] 


In Code[7.2} we define the system parameters 8, initial condition xo, and times- 
pan, and simulate the equations with a fourth-order Runge-Kutta integration 
scheme with adaptive time-step; in MATLAB we use the ode45 command and 
in Python we use the integrate.odeint command. 
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Code 7.2: [MATLAB] Define Lorenz system parameters and simulate with 
Runge-Kutta integrator. 


Beta = [10; 28; 8/3]; % Lorenz’s parameters (chaotic) 
sc Or 1; 201; 2 niga rade ConarEron 
dt = 02.000 


tspan=dt:dt:50; 
options = odeset (’RelTol’,1le-12,’AbsTol’,1le-12x*ones (1,3)); 


[t,x]=ode45(@(t,x) lorenz(t,x,Beta),tspan,x0,options) ; 
jeter CR) ee (Bp a es (Sp og 


Code 7.2: [Python] Define Lorenz system parameters and simulate with Runge- 
Kutta integrator. 


beta = 8/3 
sigma = 10 
rho = 28 

x0 = (071,20) 
dt = 0 00I 


t = np.arange(0,50+dt, dt) 


xt = Integrate- oderint (Lorenz, X0, ic, reol—l0a» (]12) aea 
=10«x (-12) xnp.ones_like (x0) ) 

Ke yr Z- aa 

plt.plot(x, y, z,linewidth=1) 


We will often consider the simpler case of an autonomous system without 
time dependence or parameters: 


d 

a 
In general, x(t) € M is an n-dimensional state that lives on a smooth mani- 
fold M, and f is an element of the tangent bundle TM of M so that f(x(t)) € 
TM. However, we will typically consider the simpler case where x is a vec- 
tor, M = R”, and f is a Lipschitz continuous function, guaranteeing existence 
and uniqueness of solutions to (7.3). For the more general formulation, see [4]. 


(t) = £(x(¢)). (7.3) 


Discrete-Time Systems 
We will also consider the discrete-time dynamical system 
Xk41 = F (xx). (7.4) 


Also known as a map, the discrete-time dynamics are more general than the 
continuous-time formulation in (7.3), encompassing discontinuous and hybrid 
systems as well. 
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Figure 7.2: Attracting sets of the logistic map for varying parameter £. 


For example, consider the logistic map: 
Ler = Brp(1 — zx). (7.5) 


As the parameter £ is increased, the attracting set becomes increasingly com- 
plex, as shown in Fig. A series of period-doubling bifurcations occur until 
the attracting set becomes fractal. 

Discrete-time dynamics may be induced from continuous-time dynamics, 
where x, is obtained by sampling the trajectory in discretely in time, so 
that x, = x(kAt). The discrete-time propagator Fa; is now parameterized by 
the time-step At. For an arbitrary time t, the flow map F is defined as 


F,(x(to)) = x(to) + f * f(x(7))dr. (7.6) 


The discrete-time perspective is often more natural when considering experi- 
mental data and digital control. 


Linear Dynamics and Spectral Decomposition 


Whenever possible, it is desirable to work with linear dynamics of the form 


x = Ax. (7.7) 


Linear dynamical systems admit closed-form solutions, and there are a wealth 
of techniques for the analysis, prediction, numerical simulation, estimation, 
and control of such systems. The solution of is given by 


x(to + t) = e™x(to). (7.8) 
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The dynamics are entirely characterized by the eigenvalues and eigenvectors 
of the matrix A, given by the spectral decomposition (eigendecomposition) of A: 


AT=TA. (7.9) 


When A has n distinct eigenvalues, then A is a diagonal matrix containing the 
eigenvalues à; and T is a matrix whose columns are the linearly independent 
eigenvectors €, associated with eigenvalues A;. In this case, it is possible to 
write A = TAT}, and the solution in becomes 


x(to + t) = Te“ T™'x(to). (7.10) 


More generally, in the case of repeated eigenvalues, the matrix A will consist 
of Jordan blocks [562]. See Section [8.2] for a detailed derivation of the above 
arguments for control systems. Note that the continuous-time system gives rise 
to a discrete-time dynamical system, with F; given by the solution map exp(At) 
in (7.8). In this case, the discrete-time eigenvalues are given by e™’. 

The matrix T~' defines a transformation, z = T~'x, into intrinsic eigen- 


vector coordinates, z, where the dynamics become decoupled: 


“a = AZ, (7.11) 
In other words, each coordinate, z;, only depends on itself, with simple dynam- 
ics given by 


= = Agee: (7.12) 
Thus, it is highly desirable to work with linear systems, since it is possible to 
easily transform the system into eigenvector coordinates where the dynamics 
become decoupled. No such closed-form solution or simple linear change of 
coordinates exist in general for nonlinear systems, motivating many of the di- 
rections described in this chapter. 


Goals and Challenges in Modern Dynamical Systems 


As we generally use dynamical systems to model real-world phenomena, there 
are a number of high-priority goals associated with the analysis of dynamical 
systems: 


(a) Future state prediction. In many cases, such as meteorology and clima- 
tology, we seek predictions of the future state of a system. Long-time pre- 
dictions may still be challenging. 
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(b) Design and optimization. We may seek to tune the parameters of a sys- 
tem for improved performance or stability, for example through the place- 
ment of fins on a rocket. 


(c) Estimation and control. It is often possible to actively control a dynam- 
ical system through feedback, using measurements of the system to in- 
form actuation to modify the behavior. In this case, it is often necessary to 
estimate the full state of the system from limited measurements. 


(d) Interpretability and physical understanding. Perhaps a more fundamen- 
tal goal of dynamical systems is to provide physical insight and inter- 
pretability into a system’s behavior through analyzing trajectories and 
solutions to the governing equations of motion. 


Real-world systems are generally nonlinear and exhibit multi-scale behav- 
ior in both space and time. It must also be assumed that there is uncertainty 
in the equations of motion, in the specification of parameters, and in the mea- 
surements of the system. Some systems are more sensitive to this uncertainty 
than others, and probabilistic approaches must be used. Increasingly, it is also 
the case that the basic equations of motion are not specified and they might be 
intractable to derive from first principles. 

This chapter will cover recent data-driven techniques to identify and ana- 
lyze dynamical systems. The majority of this chapter addresses two primary 
challenges of modern dynamical systems: 


1. Nonlinearity. Nonlinearity remains a primary challenge in analyzing and 
controlling dynamical systems, giving rise to complex global dynamics. 
We saw above that linear systems may be completely characterized in 
terms of the spectral decomposition (i.e., eigenvalues and eigenvectors) 
of the matrix A, leading to general procedures for prediction, estima- 
tion, and control. No such overarching framework exists for nonlinear 
systems, and developing this general framework is a mathematical grand 
challenge of the twenty-first century. 


The leading perspective on nonlinear dynamical systems considers the 
geometry of subspaces of local linearizations around fixed points and pe- 
riodic orbits, global heteroclinic and homoclinic orbits connecting these 
structures, and more general attractors [334]. This geometric theory, orig- 
inating with Poincaré, has transformed how we model complex systems, 
and its success can be largely attributed to theoretical results, such as the 
Hartman-—Grobman theorem, which establish when and where it is pos- 
sible to approximate a nonlinear system with linear dynamics. Thus, it is 
often possible to apply the wealth of linear analysis techniques in a small 
neighborhood of a fixed point or periodic orbit. Although the geometric 
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perspective provides quantitative locally linear models, global analysis 
has remained largely qualitative and computational, limiting the theory 
of nonlinear prediction, estimation, and control away from fixed points 
and periodic orbits. 


2. Unknown dynamics. Perhaps an even more central challenge arises from 
the lack of known governing equations for many modern systems of in- 
terest. Increasingly, researchers are tackling more complex and realistic 
systems, such as are found in neuroscience, epidemiology, and ecology. 
In these fields, there is a basic lack of known physical laws that provide 
first principles from which it is possible to derive equations of motion. 
Even in systems where we do know the governing equations, such as tur- 
bulence, protein folding, and combustion, we struggle to find patterns 
in these high-dimensional systems to uncover intrinsic coordinates and 
coarse-grained variables along which the dominant behavior evolves. 


Traditionally, physical systems were analyzed by making ideal approxi- 
mations and then deriving simple differential equation models via New- 
ton’s second law. Dramatic simplifications could often be made by ex- 
ploiting symmetries and clever coordinate systems, as highlighted by the 
success of Lagrangian and Hamiltonian dynamics [3] 486]. With increas- 
ingly complex systems, the paradigm is shifting from this classical ap- 
proach to data-driven methods to discover governing equations. 


All models are approximations, and, with increasing complexity, these 
approximations often become suspect. Determining what is the correct 
model is becoming more subjective, and there is a growing need for au- 
tomated model discovery techniques that illuminate underlying physical 
mechanisms. There are also often latent variables that are relevant to the 
dynamics but may go unmeasured. Uncovering these hidden effects is a 
major challenge for data-driven methods. 


Identifying unknown dynamics from data and learning intrinsic coordi- 
nates that enable the linear representation of nonlinear systems are two of 
the most pressing goals of modern dynamical systems. Overcoming the chal- 
lenges of unknown dynamics and nonlinearity has the promise of transforming 
our understanding of complex systems, with tremendous potential benefit to 
nearly all fields of science and engineering. 

Throughout this chapter we will explore these issues in further detail and 
describe a number of the emerging techniques to address these challenges. In 
particular, there are two key approaches that are defining modern data-driven 
dynamical systems: 


(a) Operator-theoretic representations. To address the issue of nonlinear- 
ity, operator-theoretic approaches to dynamical systems are becoming in- 
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creasingly used. As we will show, it is possible to represent nonlinear 
dynamical systems in terms of infinite-dimensional but linear operators, 
such as the Koopman operator from Section |7.4] that advances measure- 
ment functions, and the Perron—Frobenius operator that advances proba- 
bility densities and ensembles through the dynamics. 


(b) Data-driven regression and machine learning. As data becomes increas- 
ingly abundant, and we continue to investigate systems that are not amenable 
to first-principles analysis, regression and machine learning are becom- 
ing vital tools to discover dynamical systems from data. This is the basis 
of many of the techniques described in this chapter, including the dy- 
namic mode decomposition (DMD) in ction Pay the sparse identifica- 
tion of nonlinear dynamics (SINDy) in Section|7.3} the data-driven Koop- 
man methods in Section]7.5} as well as the use of genetic programming to 


identify dynamics from data [95} [640]. 


It is important to note that many of the methods and perspectives described 
in this chapter are interrelated, and continuing to strengthen and uncover these 
relationships is the subject of ongoing research. It is also worth mentioning that 
a third major challenge is the high dimensionality associated with many mod- 
ern dynamical systems, such as are found in population dynamics, brain simu- 
lations, and high-fidelity numerical discretizations of partial differential equa- 
tions. High dimensionality is addressed extensively in the subsequent chapters 
on reduced-order models (ROMs). 

Finally, several open-source software libraries are being developed for data- 
driven dynamical systems, including 


e PyDMD (https://github.com/mathLab/PyDMD); 
e PySINDy (https: //github.com/dynamicslab/pysindy); 
e PyKoopman (https://github.com/dynamicslab/pykoopman); 


e Data-driven dynamical systems toolbox (https: //github.com/sklus/ 
3s); 


e Deeptime (https: //github.com/deeptime-ml/deeptime). 


7.2 Dynamic Mode Decomposition (DMD) 


Dynamic mode decomposition was developed by Schmid in the fluid 
dynamics community to identify spatio-temporal coherent structures from high- 
dimensional data. DMD is based on proper orthogonal decomposition (POD), 
which utilizes the computationally efficient singular value decomposition (SVD), 
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so that it scales well to provide effective dimensionality reduction in high- 
dimensional systems. In contrast to SVD/POD, which results in a hierarchy of 
modes based entirely on spatial correlation and energy content, while largely 
ignoring temporal information, DMD provides a modal decomposition where 
each mode consists of spatially correlated structures that have the same linear 
behavior in time (e.g., oscillations at a given frequency with growth or decay). 
Thus, DMD provides not only dimensionality reduction in terms of a reduced 
set of modes, but also a model for how these modes evolve in time. 

Soon after the development of the original DMD algorithm [635}|636], Row- 
ley, Mezi¢, and collaborators established an important connection between DMD 
and Koopman theory (see Section [7.4). DMD may be formulated as an 
algorithm to identify the best-fit linear dynamical system that advances high- 
dimensional measurements forward in time [727]. In this way, DMD approx- 
imates the Koopman operator restricted to the set of direct measurements of 
the state of a high-dimensional system. This connection between the computa- 
tionally straightforward and linear DMD framework and nonlinear dynamical 
systems has generated considerable interest in these methods [422]. 

Within a short amount of time, DMD has become a workhorse algorithm for 
the data-driven characterization of high-dimensional systems. DMD is equally 
valid for experimental and numerical data, as it is not based on knowledge 
of the governing equations, but is instead based purely on measurement data. 
The DMD algorithm may also be seen as connecting the favorable aspects of 
the SVD (see Chapter |1) for spatial dimensionality reduction and the FFT (see 
Chapter |2) for temporal frequency identification [422]. Thus, each DMD 
mode is associated with a particular eigenvalue A = a + ib, with a particular 
frequency of oscillation b and growth or decay rate a. 

There are many variants of DMD and it is connected to existing techniques 
from system identification and modal extraction. DMD has become especially 
popular in recent years, in large part due to its simple numerical implemen- 
tation and strong connections to nonlinear dynamical systems via Koopman 
spectral theory. Finally, DMD is an extremely flexible platform, both mathemat- 
ically and numerically, facilitating innovations related to compressed sensing, 
control theory, and multi-resolution techniques. These connections and exten- 
sions will be discussed at the end of this section. 


The DMD Algorithm 


Several algorithms have been proposed for DMD, although here we present 
the exact DMD framework developed by Tu et al. [727]. Whereas earlier formu- 
lations required uniform sampling of the dynamics in time, the approach pre- 
sented here works with irregularly sampled data and with concatenated data 
from several different experiments or numerical simulations. Moreover, the ex- 
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act formulation of Tu et al. provides a precise mathematical definition of DMD 
that allows for rigorous theoretical results. Finally, exact DMD is based on the 
efficient and numerically well-conditioned singular value decomposition, as is 
the original formulation by Schmid [635]. 

DMD is inherently data-driven, and the first step is to collect a number of 
pairs of snapshots of the state of a system as it evolves in time. These snap- 
shot pairs may be denoted by {(x(tx), x(t))}g-1ı, where ti, = tg + At, and the 
time-step At is sufficiently small to resolve the highest frequencies in the dy- 
namics. As before, a snapshot may be the state of a system, such as a three- 
dimensional fluid velocity field sampled at a number of discretized locations, 
which is reshaped into a high-dimensional column vector. These snapshots are 
then arranged into two data matrices, X and X’: 


X = |x(ti) x(t2ə) 2" X(tm)]|, (7.13a) 


xX’ = |x(t,) x(t) --- xt). (7.13b) 


m 


The original formulations of Schmid and Rowley et al. assumed 
uniform sampling in time, so that tp = kAt and t, = tk + At = ty41. If we 
assume uniform sampling in time, we will adopt the notation x, = x(kAt). 

The DMD algorithm seeks the leading spectral decomposition (i.e., eigen- 
values and eigenvectors) of the best-fit linear operator A that relates the two 
snapshot matrices in time: 


X' x AX. (7.14) 


The best-fit operator A then establishes a linear dynamical system that best 
advances snapshot measurements forward in time. If we assume uniform sam- 
pling in time, this becomes 


Xk+1 ~ AXzp. (7.15) 
Mathematically, the best-fit operator A is defined as 


A = argmin |X’ — AX||p = X'Xİ, (7.16) 
A 


where || - || is the Frobenius norm and ' denotes the pseudo-inverse. The opti- 
mized DMD algorithm generalizes the optimization framework of exact DMD 
to perform a regression to exponential-time dynamics, thus providing an im- 
proved computation of the DMD modes and their eigenvalues [27]. 
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It is worth noting at this point that the matrix A in closely resembles 
the Koopman operator in Section |7.4](see equation (7.63)), if we choose direct 
linear measurements of the state, so that g(x) = x. This connection was orig- 
inally established by Rowley, Mezié, and collaborators [611], and has sparked 
considerable interest in both DMD and Koopman theory. These connections 
will be explored in more depth below. Because A is an approximate represen- 
tation of the Koopman operator restricted to a finite-dimensional subspace of 
linear measurements, we are often interested in the eigenvectors ® and eigen- 
values A of A: 


A® = @A. (7.17) 


However, for a high-dimensional state vector x € R”, the matrix A has n? ele- 
ments, and representing this operator, let alone computing its spectral decom- 
position, may be intractable. Instead, the DMD algorithm leverages dimension- 
ality reduction to compute the dominant eigenvalues and eigenvectors of A 
without requiring any explicit computations using A directly. In particular, the 
pseudo-inverse Xt in is computed via the singular value decomposition 
of the matrix X. Since this matrix typically has far fewer columns than rows, 
i.e, m <n, there are at most m non-zero singular values and corresponding 
singular vectors, and hence the matrix A will have at most rank m. Instead of 
computing A directly, we compute the projection of A onto these leading singu- 
lar vectors, resulting in a small matrix A of size at most m x m. A major contri- 
bution of Schmid was a procedure to approximate the high-dimensional 
DMD modes (eigenvectors of A) from the reduced matrix A and the data ma- 
trix X without ever resorting to computations on the full A. Tu et al. later 
proved that these approximate modes are in fact exact eigenvectors of the full 
A matrix under certain conditions. Thus, the exact DMD algorithm of Tu et al. 
is given by the following steps: 


Step 1. Compute the singular value decomposition of X (see Chapter (1): 
Xx USV’, (7.18) 


where U € Cet Se Crs" and Ve c™*" and r < m denotes either the 
exact or the approximate rank of the data matrix X. In practice, choosing 
the approximate rank r is one of the most important and subjective steps 
in DMD, and in dimensionality reduction in general. We advocate the 
principled hard-thresholding algorithm of Gavish and Donoho to 
determine r from noisy data (see Section (1.7). The columns of the matrix 
U are also known as POD modes, and they satisfy U U = I. Similarly, the 
columns of V are orthonormal and satisfy V V =I. 
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Step 2. According to (7.16), the full matrix A may be obtained by comput- 
ing the pseudo-inverse of X: 


A=xX'vz 0". (7.19) 


However, we are only interested in the leading r eigenvalues and eigen- 
vectors of A, and we may thus project A onto the POD modes in U: 


A=-UatU=-Ux'VS™. (7.20) 


The key observation here is that the reduced matrix A has the same non- 
zero eigenvalues as the full matrix A. Thus, we need only compute the 
reduced A directly, without ever working with the high-dimensional A 
matrix. The reduced-order matrix A defines a linear model for the dy- 
namics of the vector of POD coefficients x: 


Spe Aare, (7.21) 


Note that the matrix U provides a map to reconstruct the full state x from 
the reduced state x, i.e., x = Ux. 


Step 3. The spectral decomposition of A is computed: 
AW = WA. (7.22) 


The entries of the diagonal matrix A are the DMD eigenvalues, which 
also correspond to eigenvalues of the full A matrix. The columns of W 
are eigenvectors of A, and provide a coordinate transformation that diag- 
onalizes the matrix. These columns may be thought of as linear combina- 
tions of POD mode amplitudes that behave linearly with a single tempo- 
ral pattern given by A. 


Step 4. The high-dimensional DMD modes @ are reconstructed using the 
eigenvectors W of the reduced system and the time-shifted snapshot ma- 
trix X’ according to 


&=-xX’'VE W. (7.23) 


Remarkably, these DMD modes are eigenvectors of the high-dimensional 
A matrix corresponding to the eigenvalues in A, as shown in Tu et al. 
[727]: 


A® = (X VÈ Ū(X' VS W) 
— 
=- XVS AW 
=x'V= WA 
= @A. 
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In the original paper by Schmid [635], DMD modes are computed using 
@ = Uw, which are known as projected modes; however, these modes are not 
guaranteed to be exact eigenvectors of A. Because A is defined as A = X’X!, 
eigenvectors of A should be in the column space of X’, as in the exact DMD 
definition, instead of the column space of X in the original DMD algorithm. In 
practice, the column spaces of X and X’ will tend to be nearly identical for dy- 
namical systems with low-rank structure, so that the projected and exact DMD 
modes often converge. 

To find a DMD mode corresponding to a zero eigenvalue, À = 0, it is possi- 
ble to use the exact formulation if ¢ = xX'VE w # 0. However, if this expres- 
sion is null, then the projected mode @ = Uw should be used. 


Historical Perspective 


In the original formulation, the snapshot matrices X and X’ were formed with 
a collection of sequential snapshots, evenly spaced in time: 


X= |X Xo © Xml, (7.24a) 


x’ = |X? X3 ++ Xm+ılļl. (7.24b) 


Thus, the matrix X can be written in terms of iterations of the matrix A as 


Xa |x, Ax --- A™ tx]. (7.25) 


Thus, the columns of the matrix X belong to a Krylov subspace generated by 
the propagator A and the initial condition xı. In addition, the matrix X’ may 
be related to X through the shift operator as 


xX’ = XS, (7.26) 
where S is defined as 
0 0 0 0 a 
1 0 0 0 ag 
s—|0 10 0 as (7.27) 
000. 1 am 
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Thus, the first m — 1 columns of X’ are obtained directly by shifting the corre- 
sponding columns of X, and the last column is obtained as a best-fit combina- 
tion of the m columns of X that minimizes the residual. In this way, the DMD 
algorithm resembles an Arnoldi algorithm used to find the dominant eigenval- 
ues and eigenvectors of a matrix A through iteration. The matrix S will share 
eigenvalues with the high-dimensional A matrix, so that decomposition of S 
may be used to obtain dynamic modes and eigenvalues. However, computa- 
tions based on S is not as numerically stable as the exact algorithm above. 


Spectral Decomposition and DMD Expansion 


One of the most important aspects of the DMD is the ability to expand the 
system state in terms of a data-driven spectral decomposition: 


r | At by 
Xk = > br; = pAb z= pı a i Mi : ; 
j=l Ap br 
(7.28) 


where ġ; are DMD modes (eigenvectors of the A matrix), 4; are DMD eigen- 
values (eigenvalues of the A matrix), and b; is the mode amplitude. The DMD 
expansion above has a direct connection to the Koopman mode decomposition 
in Section|7.4|(see equation (7.79)). The DMD expansion may be written equiv- 
alently as 


| | b Àr 
x= | ġe ¢, a els (7.29) 
| | NIE” 


which makes it possible to express the data matrix X as 


| by Ay 284 Pe 
X=| 9, -:: Q, h, Dott ; (7.30) 
| | SESE a 


The vector b of mode amplitudes is generally computed as 
b = xı, (7.31) 


using the first snapshot to determine the mixture of DMD mode amplitudes. 
However, computing the mode amplitudes is generally quite expensive, even 
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using the straightforward definition in (7.31). Instead, it is possible to compute 
these amplitudes using POD projected data: 


x, = b (7.32a) 
— Ux,=-xX’'VE ‘Wb (7.32b) 
=> =U x'VS Wb (7.32c) 
=> *x,=AWb (7.32d) 
— = WAb (7.32e) 
— b = (WA) 1X. (7.32f) 


The matrices W and A are both size r x r, as opposed to the large ® matrix, 
which is n x r. Alternative approaches to compute b will be dis- 
cussed in the next subsection. 

The spectral expansion above may also be written in continuous time by 
introducing the continuous eigenvalues w = log(\)/At: 


x(t) = 3 pje*'b; = @ exp(Q4)b, (7.33) 


j=l 


where Q is a diagonal matrix containing the continuous-time eigenvalues wj. 
Thus, the data matrix X may be represented as 


| | by ewiti n.. eritm 
Xaj o eet Q h, Doo 8 = @diag(b)T(w). 
| | b, evrt Sees eertm 


(7.34) 


Alternative Optimizations to De-Noise and Robustify DMD 


The DMD algorithm is purely data-driven, and is thus equally applicable to 
experimental and numerical data. When characterizing experimental data with 
DMD, the effects of sensor noise and stochastic disturbances must be accounted 
for. Bagheri showed that DMD is particularly sensitive to the effects of 
noisy data, and it has been shown that significant and systematic biases are in- 
troduced to the eigenvalue distribution [195] [221] 321]. Although increased 
sampling decreases the variance of the eigenvalue distribution, it does not re- 
move the bias [321]. This noise sensitivity has motivated several alternative 
optimization algorithms for DMD to improve the quality and performance of 
DMD over the standard optimization in (7.16), which is a least-squares fitting 
procedure involving the Frobenius norm. These algorithms include the total 
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least-squares DMD [321], forward-backward DMD [195], variable projection 
[27], and robust principal component analysis [631]. 

One of the simplest ways to remove the systematic bias of the DMD algo- 
rithm is by computing it both forward and backward in time and averaging the 
equivalent matrices, as proposed by Dawson et al. [195]. Thus the two follow- 
ing approximations are considered: 


x’ y A,X and Xa AX’, (7.35) 


where A; œ~ A; for noise-free data. Thus the matrix A» is the inverse, or 
backward time-step, mapping the snapshots from t41 to ty. The forward- and 
backward-time matrices are then averaged, removing the systematic bias from 
the measurement noise: 

A = }(A; + Aj’), (7.36) 


where the optimization can be used to compute both the forward and 
backward mappings A, and A». This optimization can be formulated as 


A = argmin }(|[X’ — AX||p + |X — Av X’ |p), (7.37) 
A 


which is highly nonlinear and non-convex due to the inverse A1. An improved 
optimization framework was developed by Azencot et al. [82], which proposes 


A= argmin 5(||X'— Ay X|r+||X—A2X'lr) s.t. A, A», = I, AA] =I] (7.38) 
1,442 
to circumvent some of the difficulties of the optimization in (7.37). 

Hemati et al. formulate another DMD algorithm, replacing the origi- 
nal least-squares regression with a total least-squares regression to account for 
the possibility of noisy measurements and disturbances to the state. This work 
also provides an excellent discussion on the sources of noise and a comparison 
of various de-noising algorithms. The subspace DMD algorithm of Takeishi 
et al. compensates for measurement noise by computing an orthogonal 
projection of future snapshots onto the space of previous snapshots and then 
constructing a linear model. Extensions that combine DMD with Bayesian ap- 
proaches have also been developed [691]. 

Good approximations for the mode amplitudes b in have also proven 
to be difficult to achieve, with and without noise. Jovanović et al. de- 
veloped the first algorithm to improve the estimate of the modal amplitudes 
by promoting sparsity. In this case, the underlying optimization algorithm is 
framed around improving the approximation using the formulation 


argmin(|[X — & diag(b)T(w) |r + bl), (7.39) 
where || - ||; denotes the /1-norm penalization which promotes sparsity of b. 
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The subspace DMD algorithm of Takeishi et al. also compensates for 
measurement noise by computing an orthogonal projection of future snapshots 
onto the space of previous snapshots and then constructing a linear model. A 
Bayesian DMD approach has also been developed [691]. More recently, Askham 
and Kutz introduced the optimized DMD algorithm, which uses a vari- 
able projection method for nonlinear least-squares to compute the DMD for 
unevenly timed samples, significantly mitigating the bias due to noise. The op- 
timized DMD algorithm solves the exponential fitting problem directly: 


argmin |X — PLT (w)||r. (7.40) 


w Pp 


This has been shown to suppress bias, although one must solve a nonlinear 
optimization problem. However, using statistical bagging methods, optimized 
DMD can be stabilized and the boosted optimized DMD (BOP-DMD) method can 
not only improve performance of the decomposition, but also provide uncer- 
tainty estimates for the DMD eigenvalues and DMD eigenmodes [622]. 

DMD is able to accurately identify an approximate linear model for dynam- 
ics that are linear, periodic, or quasi-periodic. However, DMD is unable to cap- 
ture a linear dynamical system model with essential nonlinear features, such as 
multiple fixed points, unstable periodic orbits, or chaos [127]. As an example, 
DMD will fail to yield a reasonable linear model for the chaotic Lorenz sys- 
tem, and it also will not capture important features of the linear portion of the 
Lorenz model. The sparse identification of nonlinear dynamics (SINDy) [132], 
discussed in Section|7.3] is a related algorithm that identifies fully nonlinear dy- 
namical systems models from data. However, SINDy often faces scaling issues 
for high-dimensional systems that do not admit a low-dimensional subspace or 
submanifold. In this case, the recent linear and nonlinear disambiguation opti- 
mization (LANDO) algorithm leverages kernel methods to identify an im- 
plicit model for the full nonlinear dynamics, where it is then possible to extract 
a low-rank DMD approximation for the linear portion linearized about some 
specified operating condition. In this way, the LANDO algorithm robustly ex- 
tracts the linear DMD dynamics even from strongly nonlinear systems. This 
work is part of a much larger effort to use kernels for learning dynamical sys- 


tems and Koopman representations 758]. 


Example and Code 


Code [7.3] provides a basic DMD implementation. This DMD code is demon- 
strated in Fig. |7.3] for the fluid flow past a circular cylinder at Reynolds num- 
ber 100, based on the cylinder diameter. The DMD eigenvalues are shown in 
Fig. |7.4] for clean data and for data corrupted with Gaussian white noise. The 
two-dimensional Navier-Stokes equations are simulated using the immersed 
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Figure 7.3: Overview of DMD illustrated on the fluid flow past a circular cylin- 
der at Reynolds number 100. Reproduced from [422]. 
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Figure 7.4: DMD eigenvalues for the fluid flow past a circular cylinder at 
Reynolds number 100. When trained on clean data without noise, the discrete- 
time DMD eigenvalues are on the unit circle, and the continuous-time DMD 
eigenvalues are on the imaginary axis, indicating purely oscillatory dynamics. 
However, when noise is added to the training data, the eigenvalues exhibit 
spurious damping, as predicted by Bagheri [37]. 


boundary projection method (IBPM) solver] based on the fast multi-domain 
method of Taira and Colonius [688]. The data required for this example 
may be downloaded without running the IBPM code at http: //DMDbook.| 


1The IBPM code is publicly available at: https: //github.com/cwrowley/ibpm 
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Code 7.3: [MATLAB] DMD implementation. 


function [Phi, Lambda, b] = DMD(X,Xprime,r) 
[U,Sigma,V] = svd(X,’econ’); * Step 1 
War UC lea) 

Sigmar — oigna (lies P), 

waa E Naa a e e 


Atilde = Ur’ x*XprimexVr/Sigmar; 3 Step 2 
[W, Lambda] = eig(Atilde); @ Step 3 
Phi = Xprimex (Vr/Sigmar) *W; 3 Step 4 


alpha = Signar: yr (ike a) x 
b = (WxLambda)\alphal; 


Code 7.3: [Python] DMD implementation. 
def DMD(X,Xprime,r): 
U, Sigma, VI = np.linalg.svd(X, full matrices- 0) 7 Step 1 
Ure = Ul: r] 
Sigmar = np- -diago rgmalk TiN 
Vree = VEe] 
Atilde = np linalg solve (Sigmar.1,(Ur.1 @ Xprime Q VTr-T 
ol) F SC 2 
Lambda, W = np.linalg.eig(Atilde) # Step 3 
Lambda = np.diag (Lambda) 
# Step 4 
Phi = Xprime @ np.linalg.solve(Sigmar.T,VTr).T @ W 
alphal Sigmar @ Vrri:70] 
b = np.linalg.solve(W @ Lambda, alphal) 
return Phi, Lambda, b 


With this data, it is simple to compute the dynamic mode decomposition. 
In MATLAB, the following code is used: 


% VORTALL contains flow fields reshaped into column vectors 
X = VORTALL; 
[Phos Lambda, bI = DMD(X(:; endl), X(:; Zend) 210) 


In Python, the following code is used: 


vortall mat = 10.loadmat (os.path.-jJoin(" ..7 ,’ DATA’ ,’ VORTALL. 
mat) ) 

X = vortall_mat[’VORTALL’ ] 

Pag Lambdas D — DMD [es al a [hse a eee) 
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Extensions, Applications, and Limitations 


One of the major advantages of dynamic mode decomposition is its simple 
framing in terms of linear regression. DMD does not require knowledge of gov- 
erning equations. For this reason, DMD has been rapidly extended to include 
several methodological innovations and has been widely applied beyond fluid 
dynamics [422], where it originated. Here, we present a number of the lead- 
ing algorithmic extensions and promising domain applications, and we also 
present current limitations of the DMD theory that must be addressed in future 
research. 


Methodological Extensions 


Compression and Randomized Linear Algebra. DMD was originally designed 
for high-dimensional data sets in fluid dynamics, such as a fluid velocity or 
vorticity field, which may contain millions of degrees of freedom. However, 
the fact that DMD often uncovers low-dimensional structure in these high- 
dimensional data implies that there may be more efficient measurement and 
computational strategies based on principles of sparsity (see Chapter f). There 
have been several independent and highly successful extensions and modifica- 
tions of DMD to exploit low-rank structure and sparsity. 

In 2014, Jovanović et al. used sparsity-promoting optimization to iden- 
tify the fewest DMD modes required to describe a data set, essentially identify- 
ing a few dominant DMD mode amplitudes in b. The alternative approach, of 
testing and comparing all subsets of DMD modes, represents a computation- 
ally intractable brute-force search. 

Another line of work is based on the fact that DMD modes generally ad- 
mit a sparse representation in Fourier or wavelet bases. Moreover, the time 
dynamics of each mode are simple pure tone harmonics, which are the defini- 
tion of sparse in a Fourier basis. This sparsity has facilitated several efficient 
measurement strategies that reduce the number of measurements required in 
time and space [303], based on compressed sensing. This has 
the broad potential to enable high-resolution characterization of systems from 
under-resolved measurements. 

Related to the use of compressed sensing, randomized linear algebra has re- 
cently been used to accelerate DMD computations when full-state data is avail- 
able. Instead of collecting subsampled measurements and using compressed 
sensing to infer high-dimensional structures, randomized methods start with 
full data and then randomly project into a lower-dimensional subspace, where 
computations may be performed more efficiently. Bistrian and Navon have 
successfully accelerated DMD using a randomized singular value decomposi- 
tion, and Erichson et al. demonstrate how all of the expensive DMD com- 
putations may be performed in a projected subspace. 
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Finally, libraries of DMD modes have also been used to identify dynam- 
ical regimes [411], based on the sparse representation for classification 
(see Section B.6), which was used earlier to identify dynamical regimes using 
libraries of POD modes [1121 [136]. 


Inputs and Control. A major strength of DMD is the ability to describe com- 
plex and high-dimensional dynamical systems in terms of a small number of 
dominant modes, which represent spatio-temporal coherent structures. Reduc- 
ing the dimensionality of the system from n (often millions or billions) to r 
(tens or hundreds) enables faster and lower-latency prediction and estimation. 
Lower-latency predictions generally translate directly into controllers with higher 
performance and robustness. Thus, compact and efficient representations of 
complex systems such as fluid flows have been long sought, resulting in the 
field of reduced-order modeling. However, the original DMD algorithm was 
designed to characterize naturally evolving systems, without accounting for 
the effect of actuation and control. 

Shortly after the original DMD algorithm, Proctor et al. extended the 
algorithm to disambiguate between the natural unforced dynamics and the ef- 
fect of actuation. This essentially amounts to a generalized evolution equation 


Xp41 © AX; + Bug, (7.41) 


which results in another linear regression problem (see Section [10.2}. 

The original motivation for DMD with control (DMDc) was the use of DMD 
to characterize epidemiological systems (e.g., malaria spreading across a conti- 
nent), where it is not possible to stop intervention efforts, such as vaccinations 
and bed nets, in order to characterize the unforced dynamics [569]. 

Since the original DMDc algorithm, the compressed sensing DMD and DMDc 
algorithms have been combined, resulting in a new framework for compressive 
system identification [41]. In this framework, it is possible to collect undersam- 
pled measurements of an actuated system and identify an accurate and efficient 
low-order model, related to DMD and the eigensystem realization algorithm 
(ERA; see Section 9.3) [358]. 

DMDc models, based on linear and nonlinear measurements of the system, 
have recently been used with model predictive control (MPC) for enhanced 
control of nonlinear systems by Korda and Mezié [404]. Model predictive con- 
trol using DMDc models was subsequently used as a benchmark comparison 
for MPC based on fully nonlinear models in the work of Kaiser et al. [366], and 
the DMDc models performed surprisingly well, even for strongly nonlinear 
systems. 


Nonlinear Measurements. Much of the excitement around DMD is due to 
the strong connection to nonlinear dynamics via the Koopman operator [611]. 
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Indeed, DMD is able to accurately characterize periodic and quasi-periodic be- 
havior, even in nonlinear systems, as long as a sufficient amount of data is 
collected. However, the basic DMD algorithm uses linear measurements of 
the system, which are generally not rich enough to characterize truly non- 
linear phenomena, such as transients, intermittent phenomena, or broadband 
frequency cross-talk. In Williams et al. [757], DMD measurements were aug- 
mented to include nonlinear measurements of the system, enriching the basis 
used to represent the Koopman operator. The so-called extended DMD (eDMD) 
algorithm then seeks to obtain a linear model Ay advancing nonlinear mea- 
surements y = g(x): 


Yk+1 © Ayyr. (7.42) 


For high-dimensional systems, this augmented state y may be intractably large, 
motivating the use of kernel methods to approximate the evolution operator 
Ay [758]. This kernel DMD has since been extended to include dictionary learn- 
ing techniques [440]. 

It has recently been shown that eDMD is equivalent to the variational ap- 
proach of conformation dynamics (VAC) [535], first derived by Noé 
and Niiske in 2013 to simulate molecular dynamics with a broad separation of 
timescales. Further connections between eDMD and VAC and between DMD 
and the time-lagged independent component analysis (TICA) are explored in a 
recent review [392]. A key contribution of VAC is a variational score enabling 
the objective assessment of Koopman models via cross-validation. 

Following the extended DMD, it was shown that there are relatively restric- 
tive conditions for obtaining a linear regression model that includes the origi- 
nal state of the system [127]. For nonlinear systems with multiple fixed points, 
periodic orbits, and other attracting structures, there is no finite-dimensional 
linear system including the state x that is topologically conjugate to the nonlin- 
ear system. Instead, it is important to identify Koopman-invariant subspaces, 
spanned by eigenfunctions of the Koopman operator; in general, it will not be 
possible to directly write the state x in the span of these eigenvectors, although 
it may be possible to identify x through a unique inverse. A practical algorithm 
for identifying eigenfunctions is provided by Kaiser et al. [365]. 


Multi-Resolution. DMD is often applied to complex, high-dimensional dy- 
namical systems, such as fluid turbulence or epidemiological systems, that ex- 
hibit multi-scale dynamics in both space and time. Many multi-scale systems 
exhibit transient or intermittent phenomena, such as the El Niño observed in 
global climate data. These transient dynamics are not captured accurately by 
DMD, which seeks spatio-temporal modes that are globally coherent across 
the entire time series of data. To address this challenge, the multi-resolution 
DMD (mrDMD) algorithm was introduced [423], which effectively decomposes 
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the dynamics into different timescales, isolating transient and intermittent pat- 
terns. Multi-resolution DMD modes were recently shown to be advantageous 
for sparse sensor placement by Manohar et al. [483]. 


Delay Measurements. Although DMD was developed for high-dimensional 
data where it is assumed that one has access to the full state of a system, it 
is often desirable to characterize spatio-temporal coherent structures for sys- 
tems with incomplete measurements. As an extreme example, consider a single 
measurement that oscillates as a sinusoid, x(t) = sin(wt). Although this would 
appear to be a perfect candidate for DMD, the algorithm incorrectly identifies 
a real eigenvalue because the data does not have sufficient rank to extract a 
complex conjugate pair of eigenvalues +iw. This paradox was first explored 
by Tu et al. [727], where it was discovered that a solution is to stack delayed 
measurements into a larger matrix to augment the rank of the data matrix and 
extract phase information. Delay coordinates have been used effectively to ex- 
tract coherent patterns in neural recordings [123]. The connections between de- 
lay DMD and Koopman will be discussed more in Section|7.5| 


Streaming and Parallelized Codes. Because of the computational burden of 
computing the DMD on high-resolution data, several advances have been made 
to accelerate DMD in streaming applications and with parallelized algorithms. 
DMD is often used in a streaming setting, where a moving window of snap- 
shots are processed continuously, resulting in redundant computations when 
new data becomes available. Several algorithms exist for streaming DMD, based 
on the incremental SVD [322], a streaming method of snapshots SVD [559], and 
rank-one updates to the DMD matrix [772]. The DMD algorithm is also readily 
parallelized, as it is based on the SVD. Several parallelized codes are available, 


based on the OR [623] and SVD [234 236]. 


Tensor Formulations. Most data used to compute DMD has additional spa- 
tial structure that is discarded when the data is reshaped into column vec- 
tors. The tensor DMD extension of Klus et al. performs DMD on a ten- 
sorial, rather than vectorized, representation of the data, retaining this addi- 
tional structure. In addition, this approach reduces the memory requirements 
and computational complexity for large-scale systems. Extensions to this ap- 
proach have been introduced based on reproducing kernel Hilbert spaces 
and the extended DMD [533], and additional connections have recently been 
made between the Koopman mode decomposition and tensor factorizations 
[594]. Tensor approaches to related methods, such as the sparse identification 
of nonlinear dynamics [132], have also been developed recently [273]. 
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Resolvent Analysis. DMD and Koopman operator theory have also been con- 
nected to the resolvent analysis from fluid mechanics [8241655]. Resolvent anal- 
ysis seeks to find the most receptive states of a dynamical system that will be 
most amplified by forcing, along with the corresponding most responsive forc- 
ings [354) (355) |495} |712]. Sharma, Mezić, and McKeon established several 
important connections between DMD, Koopman theory, and the resolvent op- 
erator, including a generalization of DMD to enforce symmetries and traveling 
wave structures. They also showed that the resolvent modes provide an opti- 
mal basis for the Koopman mode decomposition. Typically, resolvent analysis 
is performed by linearizing the governing equations about a base state, often a 
turbulent mean flow. However, this approach is invasive, requiring a working 
Navier-Stokes solver. Herrmann et al. have recently developed a purely 
data-driven resolvent algorithm, based on DMD, that bypasses knowledge of 
the governing equations. DMD and resolvent analysis are also both closely re- 
lated to the spectral POD [708], which is related to the classical POD 
of Lumley and provides time-harmonic modes at a set of discrete frequencies. 


Applications 


Fluid Dynamics. DMD originated in the fluid dynamics community [635], 
and has since been applied to a wide range of flow geometries (jets, cavity 
flow, wakes, channel flow, boundary layers, etc.) to study mixing, acoustics, 
and combustion, among other phenomena. In the original papers of Schmid 
[636], both a cavity flow and a jet were considered. In the original paper 
of Rowley et al. [61T], a jet in cross-flow was investigated. It is no surprise that 
DMD has subsequently been used widely in both cavity flows 
and jets [68}1637} [638,1649]. 

DMD has also been applied to wake flows, including to investigate fre- 
quency lock-on [725], the wake past a gurney flap [543], the cylinder wake 
[36], and dynamic stall [223]. Boundary layers have also been extensively stud- 
ied with DMD [624]. In acoustics, DMD has been used to capture the 
near-field and far-field acoustics that result from instabilities observed in shear 
flows [668]. In combustion, DMD has been used to understand the coherent 
heat release in turbulent swirl flames and to analyze a rocket combus- 
tor [341]. DMD has also been used to analyze non-normal growth mechanisms 
in thermoacoustic interactions in a Rijke tube. DMD has been compared with 
POD for reacting flows [612]. DMD has also been used to analyze more ex- 
otic flows, including a simulated model of a high-speed train [514]. Shock- 
turbulent boundary layer interaction (STBLI) has also been investigated, and 
DMD was used to identify a pulsating separation bubble that is accompanied 
by shockwave motion [297]. DMD has also been used to study self-excited fluc- 
tuations in detonation waves [490]. Other problems include identifying hairpin 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


330 CHAPTER 7. DATA-DRIVEN DYNAMICAL SYSTEMS 


vortices [695], decomposing the flow past a surface-mounted cube [515], mod- 
eling shallow-water equations [92], studying nanofluids past a square cylin- 
der [620], fluid-structure interaction [292], and measuring the growth rate of 
instabilities in annular liquid sheets [220]. A modified recursive DMD algo- 
rithm was also formulated by Noack et al. to provide an orthogonal ba- 
sis for empirical Galerkin models in fluids. The use of DMD in fluids fits into 
a broader effort to leverage machine learning for improved models and con- 
trollers B99, 441, |617], especially for turbulence closure modeling 
[64] [65} [224] [421] |448) [492]. 


Epidemiology. DMD has recently been applied to investigate epidemiologi- 
cal systems by Proctor and Eckhoff [568]. This is a particularly interpretable ap- 
plication, as modal frequencies often correspond to yearly or seasonal fluctua- 
tions. Moreover, the phase of DMD modes gives insight into how disease fronts 
propagate spatially, potentially informing future intervention efforts. The ap- 
plication of DMD to disease systems also motivated the DMD with control 
[570], since it is infeasible to stop vaccinations in order to identify the unforced 
dynamics. 


Neuroscience. Complex signals from neural recordings are increasingly high- 
fidelity and high-dimensional, with advances in hardware pushing the fron- 
tiers of data collection. DMD has the potential to transform the analysis of 
such neural recordings, as evidenced in a recent study that identified dynami- 
cally relevant features in electrocorticography (ECOG) data of sleeping patients 
[123]. Since then, several works have applied DMD to neural recordings or sug- 
gested possible implementation in hardware [5}/1171|704]. 


Video Processing. Separating foreground and background objects in video is 
a common task in surveillance applications. Real-time separation is a challenge 
that is only exacerbated by ever-increasing video resolutions. DMD provides a 
flexible platform for video separation, as the background may be approximated 


by a DMD mode with zero eigenvalue [232] 5959]. 


Other Applications. DMD has been applied to an increasingly diverse array 
of problems, including robotics [78], finance [480], and plasma physics [697]. It 
is expected that this trend will increase. 


Challenges 


Traveling Waves. DMD is based on the SVD of a data matrix X = UXV* 
whose columns are spatial measurements evolving in time. In this case, the 
SVD is a space-time separation of variables into spatial modes, given by the 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


7.3. SPARSE IDENTIFICATION OF NONLINEAR DYNAMICS (SINDY) 331 


columns of U, and time dynamics, given by the columns of V. As in POD, 
DMD thus has limitations for problems that exhibit traveling waves, where 
separation of variables is known to fail. 


Transients. Many systems of interest are characterized by transients and in- 
termittent phenomena. Several methods have been proposed to identify these 
events, such as the multi-resolution DMD and the use of delay coordinates. 
However, it is still necessary to formalize the choice of relevant timescales and 
the window size to compute DMD. 


Continuous Spectrum. Related to the above, many systems are characterized 
by broadband frequency content, as opposed to a few distinct and discrete fre- 
quencies. This broadband frequency content is also known as a continuous spec- 
trum, where every frequency in a continuous range is observed. For example, 
the simple pendulum exhibits a continuous spectrum, as the system has a nat- 
ural frequency for small deflections, and this frequency continuously deforms 
and slows as energy is added to the pendulum. Other systems include nonlin- 
ear optics and broadband turbulence. These systems pose a serious challenge 
for DMD, as they result in a large number of modes, even though the dynamics 
are likely generated by the nonlinear interactions of a few dominant modes. 

Several data-driven approaches have been recently proposed to handle sys- 
tems with continuous spectra. Applying DMD to a vector of delayed measure- 
ments of a system, the so-called HAVOK analysis in Section|7.5] has been shown 
to approximate the dynamics of chaotic systems, such as the Lorenz system, 
which exhibits a continuous spectrum. In addition, Lusch et al. showed 
that it is possible to design a deep learning architecture with an auxiliary net- 
work to parameterize the continuous frequency. 


Strong Nonlinearity and Choice of Measurements. Although significant progress 
has been made connecting DMD to nonlinear systems [758], choosing nonlin- 
ear measurements to augment the DMD regression is still not an exact science. 
Identifying measurement subspaces that remain closed under the Koopman 
operator is an ongoing challenge [127]. Recent progress in deep learning has 
the potential to enable the representation of extremely complex eigenfunctions 


from data [465 766]. 


7.3 Sparse Identification of Nonlinear Dynamics (SINDy) 


Discovering dynamical systems models from data is a central challenge in math- 
ematical physics, with a rich history going back at least as far as the time of 
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Kepler and Newton and the discovery of the laws of planetary motion. His- 
torically, this process relied on a combination of high-quality measurements 
and expert intuition. With vast quantities of data and increasing computational 
power, the automated discovery of governing equations and dynamical systems 
is a new and exciting scientific paradigm. 

Typically, either the form of a candidate model is constrained via prior 
knowledge of the governing equations, as in Galerkin projection 
(see Chapter [13), or a handful of heuristic models are 
tested and parameters are optimized to fit data. Alternatively, best-fit linear 
models may be obtained using DMD or ERA. Simultaneously identifying the 
nonlinear structure and parameters of a model from data is considerably more 
challenging, as there are combinatorially many possible model structures. 

The sparse identification of nonlinear dynamics (SINDy) algorithm 
bypasses the intractable combinatorial search through all possible model struc- 
tures, leveraging the fact that many dynamical systems 


d 
“t= f(x) (7.43) 
have dynamics f with only a few active terms in the space of possible right- 
hand side functions; for example, the Lorenz equations in only have a few 
linear and quadratic interaction terms per equation. 

We then seek to approximate f by a generalized linear model 


f(x) © X` Ox) = O(X)E, (7.44) 


with the fewest non-zero terms in € as possible. It is then possible to solve 
for the relevant terms that are active in the dynamics using sparse regression 
that penalizes the number of terms in the dynamics and 
scales well to large problems. 

First, time-series data is collected from and formed into a data matrix: 


X = [x(ti) xlt) --- x(tm)]”. (7.45) 
A similar matrix of derivatives is formed: 


x = [x(t1) x(t) aed X(tm)] 


(7.46) 
In practice, this may be computed directly from the data in X; for noisy data, 
the total-variation regularized derivative tends to provide numerically robust 
derivatives [169]. Alternatively, it is possible to formulate the SINDy algorithm 


for discrete-time systems x;,,; = F(x;), as in the DMD algorithm, and avoid 
derivatives entirely. 
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A library of candidate nonlinear functions @(X) may be constructed from 
the data in X: 


@(X)=[1 X KX + K? sin(X) =]. (7.47) 


Here, the matrix X! denotes a matrix with column vectors given by all possible 
time series of dth-degree polynomials in the state x. In general, this library of 
candidate functions is only limited by one’s imagination. 

The dynamical system in may now be represented in terms of the data 
matrices in and as 


X = O(X)E. (7.48) 


Each column &, in E is a vector of coefficients determining the active terms in 
the kth row in (7.43). A parsimonious model will provide an accurate model 
fit in with as few terms as possible in =. Such a model may be identified 
using a convex ¢-regularized sparse regression: 


Er = Dig Xx — O(K)Eglla + MME lla. (7.49) 


k 


Here, X; is the kth column of X, and À is a sparsity-promoting knob. Sparse 
regression, such as the LASSO or the sequential thresholded least-squares 
(STLS) algorithm used in SINDy [132], improves the numerical robustness of 
this identification for noisy over-determined problems, in contrast to earlier 
methods that used compressed sensing (717). 
We advocate the STLS (Code]7.4) to select active terms. 


Code 7.4: [MATLAB] Sequentially thresholded least-squares. 
function Xi = sparsifyDynamics (Theta, dXdt, lambda, n) 


% Compute Sparse regression: sequential least squares 
Xi = Theta\dXdt; % Initial guess: Least-squares 
2 Lambda is our Sparsiftication Knob. 
for k=1:10 
smallinds = (abs (Xi) <lambda); 
Xi(smallinds)=0; 
for ind = l:n 


Find small coefficients 
and threshold 
n is state dimension 


oo o o 


biginds = “smallinds (71nd); 
% Regress dynamics onto remaining terms to find sparse Xi 
Xi (biginds, and) =" Phebe, bomneds) Naka (Find). 
end 


end 
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Figure 7.5: Schematic of the sparse identification of nonlinear dynamics 
(SINDy) algorithm [132]. Parsimonious models are selected from a library of 
candidate nonlinear terms using sparse regression. This library O©(X) may be 
constructed purely from measurement data. Modified from Brunton et al. [132]. 


Code 7.4: [Python] Sequentially thresholded least-squares. 


def sparsifyDynamics (Theta, dXdt,lamb,n) : 
# Initial guess: Least-squares 
Xi = np.linalg.lstsq(Theta, dxdt, rcond=None) [0] 


for k in range(10): 
smallinds = np.abs(Xi) < lamb # Find small coeffs. 


Xifsmallinds] = 0 # and threshold 

for ind in range(n): # n is state dimension 
biginds = smallinds[:,ind] == 
# Regress onto remaining terms to find sparse Xi 
Ki [brginds; ind] = np.linalg.lstsq (ihetal:, 


biginds],dXdt[:,ind],rcond=None) [0] 


return Xi 


The sparse vectors €, may be synthesized into a dynamical system: 
tk = O(x)E,: (7.50) 


Note that t is the kth element of x and ©(x) is a row vector of symbolic func- 
tions of x, as opposed to the data matrix @(X). Figure [7.5] shows how SINDy 
may be used to discover the Lorenz equations from data. Code [7.5] performs 
the SINDy regression for the Lorenz system based on the data generated in 


Code [7.2] 
Code 7.5: [MATLAB] SINDy regression to identify the Lorenz system from 
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data. 


%% Compute derivatives by evaluating lorenz on trajectory x 
for i=1:length (x) 

Oxi) = lorenz (O;x (1i; :) Beta) > 
end 


3% Build library and compute sparse regression 


Theta = poolData(x,n,3); os Up to Ehurd order polynomials 
lambda = 0.025; 2 lambda is Our Sparsi fication knob. 
Xi = sparsifyDynamics (Theta, dx, lambda, n) 


Code 7.5: [Python] SINDy regression to identify the Lorenz system from data. 
## Compute Derivative 

dx = np.zeros_like (x) 

for j in range(len(t)): 


daxlj -l = lorenz deriv(xlj,:l07sigma, beta, chic) 


Theta = poolData(x;n,3) 7 Up to thara order polynomials 
lamb = 0.025 # sparsification knob lambda 
Xi = sparsifyDynamics (Theta, dx, lamb, n) 


This code also relies on a function poolData that generates the library ©. 
In this case, polynomials up to third order are used. This code is available on- 
line. For more in-depth applications, we strongly recommend using the open- 


source Python software package PySINDy [199] at 


The output of the SINDy algorithm is a sparse matrix of coefficients =: 


As BOO is" oiscloiay eo Aoley ae 
rll [ 0] [ 0] [ 0 
ie O OXCHONG, [28.0000] 0 
Np [ 10.0000 [-1.0000] 0 
rat [ 0] [ 0] [2.6667 
Ee [ 0] [ 0] [ 0 
Eee 0 0] 1.0000 
eg 0 E OON] 0 
ve 0] [ 0] 0 
LN eae 0] 0] 0 
fez o 0] 0 
LS Gas O 0] 0 
Pexy” 0 0] 0 
xx? oj 0] 0 
Ly 0] 0] 0 
iz 0 0] 0 
Paget O 0] 0 
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Figure 7.6: Schematic overview of nonlinear model identification from high- 
dimensional data using the sparse identification of nonlinear dynamics 
(SINDy) [132]. This procedure is modular, so that different techniques can be 
used for the feature extraction and regression steps. In this example of flow 
past a cylinder, SINDy discovers the model of Noack et al. [524]. Modified from 
Brunton et al. [132]. 
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The result of the SINDy regression is a parsimonious model that includes 
only the most important terms required to explain the observed behavior. The 
sparse regression procedure used to identify the most parsimonious nonlinear 
model is a convex procedure. The alternative approach, which involves regres- 
sion onto every possible sparse nonlinear structure, constitutes an intractable 
brute-force search through the combinatorially many candidate model forms. 
SINDy bypasses this combinatorial search with modern convex optimization 
and machine learning. It is interesting to note that, for discrete-time dynamics, 
if O(X) consists only of linear terms, and if we remove the sparsity-promoting 
term by setting \ = 0, then this algorithm reduces to the dynamic mode decom- 
position [727]. If a least-squares regression is used, as in DMD, 
then even a small amount of measurement error or numerical round-off will 
lead to every term in the library being active in the dynamics, which is non- 
physical. A major benefit of the SINDy architecture is the ability to identify 
parsimonious models that contain only the required nonlinear terms, resulting 
in interpretable models that avoid overfitting. 
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Applications, Extensions, and Historical Context 


The SINDy algorithm has recently been applied to identify high-dimensional 
dynamical systems, such as fluid flows, based on POD coefficients 
[456]. Figure illustrates the application of SINDy to the flow past a cylin- 
der, where the generalized mean-field model of Noack et al. was dis- 
covered from data. Since its introduction, SINDy has been applied to a wide 
range of systems, including for reduced-order models of fluid dynamics 
and plasma dynamics [373], turbulence clo- 
sures [[64) [65] 634], nonlinear optics [670], numerical integration schemes [701], 
discrepancy modeling 363], boundary value problems [656], identifying 
dynamics on Poincaré maps [104] [105], tensor formulations [273], and systems 
with stochastic dynamics [96] [146]. The integral formulation of SINDy has 
also proven to be powerful, enabling the identification of governing equations 
in a weak form that averages over control volumes; this approach has recently 
been used to discover a hierarchy of fluid and plasma models [[12)|305}/5971/598]. 
The open-source software package, PySINDAF has been developed in Python 
to integrate the various extensions of SINDy [199], such as promoting global 
boundedness by incorporating the Schlegel and Noack constraint in 
the optimization. 

Because SINDy is formulated in terms of linear regression in a nonlinear 
library, it is highly extensible. The SINDy framework has been recently gener- 
alized by Loiseau and Brunton to incorporate known physical constraints 
and symmetries in the equations by implementing a constrained sequentially 
thresholded least-squares optimization. In particular, energy-preserving con- 
straints on the quadratic nonlinearities in the Navier-Stokes equations were 
imposed to identify fluid systems [455], where it is known that these constraints 
promote stability 472]. This work also showed that polynomial li- 
braries are particularly useful for building models of fluid flows in terms of 
POD coefficients, yielding interpretable models that are related to classical Galerkin 
projection [132,455]. Loiseau et al. also demonstrated the ability of SINDy 
to identify dynamical systems models of high-dimensional systems, such as 
fluid flows, from a few physical sensor measurements, such as lift and drag 
measurements on the cylinder in Fig.|7.6| For actuated systems, SINDy has been 
generalized to include inputs and control [133], and these models are highly ef- 
fective for model predictive control [366]. It is also possible to extend the SINDy 
algorithm to identify dynamics with rational function nonlinearities 478], 
integral terms [627], and based on highly corrupt and incomplete data [709]. 
SINDy was also recently extended to incorporate information criteria for objec- 
tive model selection [479], and to identify models with hidden variables using 
delay coordinates [126]. Champion et al. combined SINDy with a deep 


‘https: //github.com/dynamicslab/pysindy 
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autoencoder neural network to simultaneously learn coordinates and dynam- 
ics, which will be discussed in Chapter [14] Finally, the SINDy framework was 
generalized to include partial derivatives, enabling the identification of partial 
differential equation models [[613} [626]. Several of these recent innovations will 
be explored in more detail below. 

More generally, the use of sparsity-promoting methods in dynamics is quite 
recent [743]. Other techniques for 
dynamical system discovery include methods to discover equations from time 
series [186], equation-free modeling [383], empirical dynamic modeling 
765], modeling emergent behavior [603], the nonlinear autoregressive model 
with exogenous inputs (NARMAX) [774], and automated infer- 
ence of dynamics 641]. Broadly speaking, these techniques may be 
classified as system identification, where methods from statistics and machine 
learning are used to identify dynamical systems from data. Nearly all methods 
of system identification involve some form of regression of data onto dynamics, 
and the main distinction between the various techniques is the degree to which 
this regression is constrained. For example, the dynamic mode decomposition 
generates best-fit linear models. Recent nonlinear regression techniques have 
produced nonlinear dynamic models that preserve physical constraints, such as 
conservation of energy. A major breakthrough in automated nonlinear system 
identification was made by Bongard and Lipson and Schmidt and Lipson 
[640], where they used genetic programming to identify the structure of non- 
linear dynamics. These methods are highly flexible and impose very few con- 
straints on the form of the dynamics identified. In addition, SINDy is closely 
related to NARMAX [86], which identifies the structure of models from time- 
series data through an orthogonal least-squares procedure. 


Discovering Partial Differential Equations 


A major extension of the SINDy modeling framework generalized the library 
to include partial derivatives, enabling the identification of partial differen- 
tial equations [626]. The resulting algorithm, called the partial differential 
equation functional identification of nonlinear dynamics (PDE-FIND), has been 
demonstrated to successfully identify several canonical PDEs from classical 
physics, purely from noisy data. These PDEs include Navier-Stokes, Kuramoto— 
Sivashinsky, Schrödinger, reaction—diffusion, Burgers, Korteweg-de Vries, and 
the diffusion equation for Brownian motion [613]. 

PDE-FIND is similar to SINDy, in that it is based on sparse regression in a 
library constructed from measurement data. The sparse regression and discov- 
ery method is shown in Fig. PDE-FIND is outlined below for PDEs in a sin- 
gle variable, although the theory is readily generalized to higher-dimensional 
PDEs. The spatial time-series data is arranged into a single column vector Y € 
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Figure 7.7: Steps in the PDE-FIND algorithm, applied to infer the Navier-Stokes 
equations from data. 1a. Data is collected as snapshots of a solution to a PDE. 
1b. Numerical derivatives are taken and data is compiled into a large matrix 
©, incorporating candidate terms for the PDE. 1c. Sparse regression is used 
to identify active terms in the PDE. 2a. For large data sets, sparse sampling 
may be used to reduce the size of the problem. 2b. Subsampling the data set is 
equivalent to taking a subset of rows from the linear system in (7.52). 2c. An 
identical sparse regression problem is formed but with fewer rows. d. Active 
terms in £ are synthesized into a PDE. Reproduced from Rudy et al. [613]. 


C™”, representing data collected over m time points and n spatial locations. 
Additional inputs, such as a known potential for the Schrödinger equation, or 
the magnitude of complex data, are arranged into a column vector Q € C””. 
Next, a library O(Y, Q) € C™"*” of D candidate linear and nonlinear terms 
and partial derivatives for the PDE is constructed. Derivatives are taken either 
using finite differences for clean data, or, when noise is added, with polynomial 
interpolation. The candidate linear and nonlinear terms and partial derivatives 
are then combined into a matrix O(Y, Q) which takes the form 


O(Y,Q)=[1 Y P- Qe Ya TY, «J. (7.51) 


Each column of © contains all of the values of a particular candidate function 
across all of the mn space-time grid points on which data is collected. The time 
derivative Y, is also computed and reshaped into a column vector. Figure 
demonstrates the data collection and processing. As an example, a column of 
©(Y, Q) may be qu?. 

The PDE evolution can be expressed in this library as follows: 


Y, = O(T, QE. (7.52) 
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Each entry in € is a coefficient corresponding to a term in the PDE, and, for 
canonical PDEs, the vector € is sparse, meaning that only a few terms are active. 

If the library © has a sufficiently rich column space that the dynamics are 
in its span, then the PDE should be well represented by with a sparse 
vector of coefficients €. To identify the few active terms in the dynamics, a 
sparsity-promoting regression is employed, as in SINDy. Importantly, the re- 
gression problem in (7.52) may be poorly conditioned. Error in computing the 
derivatives will be magnified by numerical errors when inverting ©. Thus a 
least-squares regression radically changes the qualitative nature of the inferred 
dynamics. 

In general, we seek the sparsest vector € that satisfies with a small 
residual. Instead of an intractable combinatorial search through all possible 
sparse vector structures, a common technique is to relax the problem to a con- 
vex (,-regularized least-squares [702]; however, this tends to perform poorly 
with highly correlated data. Instead, we use ridge regression with hard thresh- 
olding, which we call sequential threshold ridge regression (STRidge in Algo- 
rithm 1, reproduced from Rudy et al. [613]). For a given tolerance and threshold 
A, this gives a sparse approximation to €. We iteratively refine the tolerance of 
Algorithm 1 to find the best predictor based on the selection criteria, 


€ = argming||O(Y, QE — Ta|l3 + ex(O(Y, Q))IlEllo. (7.53) 


where x(O) is the condition number of the matrix ©, providing stronger reg- 
ularization for ill-posed problems. Penalizing ||£||ọ discourages overfitting by 
selecting from the optimal position in a Pareto front. 


Algorithm 1: STRidge(®, Y,, A, tol, iters) [613]. 
ê = arg ming||O€ — Yll + Alléll % ridge regression 
bigcoeffs = {j : |Ê j| > tol} % select large coefficients 
El ~ bigcoeffs] = 0 % apply hard threshold 
È [bigcoeffs] = STRidge(O|[:, bigcoeffs], Y,, tol, iters — 1) 
% recursive call with fewer coefficients 
return È 


As in the SINDy algorithm, it is important to provide sufficiently rich train- 
ing data to disambiguate between several different models. For example, Fig.[7.8] 
illustrates the use of the PDE-FIND algorithm identifying the Korteweg-de 
Vries (KdV) equation. If only a single traveling wave is analyzed, the method 
incorrectly identifies the standard linear advection equation, as this is the sim- 
plest equation that describes a single traveling wave. However, if two traveling 
waves of different amplitudes are analyzed, the KdV equation is correctly iden- 
tified, as it describes the different amplitude-dependent wave speeds. 
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Figure 7.8: Inferring nonlinearity via observing solutions at multiple ampli- 
tudes. (a) An example two-soliton solution to the KdV equation. (b) Applying 
our method to a single-soliton solution determines that it solves the standard 
advection equation. (c) Looking at two completely separate solutions reveals 
nonlinearity. Reproduced from Rudy et al. [613]. 


The PDE-FIND algorithm can also be used to identify PDEs based on La- 
grangian measurements that follow the path of individual particles. For exam- 
ple, Fig. illustrates the identification of the diffusion equation describing 
Brownian motion of a particle based on a single long time-series measurement 
of the particle position. In this example, the time series is broken up into several 
short sequences, and the evolution of the distribution of these positions is used 
to identify the diffusion equation. 


Extension of SINDy for Rational Function Nonlinearities 


Many dynamical systems, such as metabolic and regulatory networks in biol- 
ogy, contain rational function nonlinearities in the dynamics. Often, these ra- 
tional function nonlinearities arise because of a separation of timescales. Al- 
though the original SINDy algorithm is highly flexible in terms of the choice of 
the library of nonlinearities, it is not straightforward to identify rational func- 
tions, since general rational functions are not sparse linear combinations of a 
few basis functions. Instead, it is necessary to reformulate the dynamics in an 
implicit ordinary differential equation and modify the optimization procedure 
accordingly, as in Mangan et al. and Kaheman et al. [364]. 

We consider dynamical systems with rational nonlinearities: 


bs f(x) 
fp(x)’ 
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Figure 7.9: Inferring the diffusion equation from a single Brownian motion. 
(a) The time series is broken into many short random walks that are used to 
construct histograms of the displacement. (b) The Brownian motion trajectory, 
following the diffusion equation. (c) Parameter error (|/€* — €||,) versus length 
of known time series. Blue symbols correspond to correct identification of the 
structure of the diffusion model, u: = Cuzz. Reproduced from Rudy et al. [613]. 


where x; is the kth variable, and fy(x) and fp(x) represent numerator and 
denominator polynomials in the state variable x. For each index k, it is possible 
to multiply both sides by the denominator fp, resulting in the equation: 


fu(x) — fo(x)tx = 0. (7.55) 
The implicit form of (7.55) motivates a generalization of the function library 

© in (7.47) in terms of the state x and the derivative «;: 
O(X, é,(t)) = [On(X) diag(#,(t))On(X)] . (7.56) 


The first term, © y(X), is the library of numerator monomials in x, as in (7.47). 
The second term, diag(%;,(t))Op(X), is obtained by multiplying each column 
of the library of denominator polynomials © p(X) with the vector tą(t) in an 
element-wise fashion. For a single variable x; this would give the following: 


diag(#,(t))O(X) = [a.(t) (ere)(t) (ike) (t) =]. (7.57) 


In most cases, we will use the same polynomial degree for both the numer- 
ator and denominator library, so that Oy(X) = Op(X). Thus, the augmented 
library in (7.56) is only twice the size of the original library in (7.47). 
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We may now write the dynamics in (7.55) in terms of the augmented library 


in (7.56): 
O(X, x,(t))€, = 0. (7.58) 


The sparse vector of coefficients €, will have non-zero entries for the active 
terms in the dynamics. However, it is not possible to use the same sparse re- 
gression procedure as in SINDy, since the sparsest vector €, that satisfies 
is the trivial zero vector. 

Instead, the sparsest non-zero vector &, that satisfies is identified as 
the sparsest vector in the null space of ©. This is generally a non-convex prob- 
lem, although there are recent algorithms developed by Qu et al. [576], based 
on the alternating directions method (ADM), to identify the sparsest vector in 
a subspace. Unlike the original SINDy algorithm, this procedure is quite sensi- 
tive to noise, as the null space is numerically approximated as the span of the 
singular vectors corresponding to small singular value. When noise is added to 
the data matrix X, and hence to ©, the noise floor of the singular value decom- 
position goes up, increasing the rank of the numerical null space. 

A recent technique by Kaheman et al. circumvents this ill-conditioned 
search through the null space of ©. Instead, this approach picks a candidate 
term from the library and moves it to the right-hand side, so that the regression 
problem no longer involves a null space. Candidate terms are tested until one 
is found that is actually in the model, after which it is possible to find a sparse 
model that is also accurate. 


General Formulation for Implicit ODEs 


The optimization procedure above may be generalized to include a larger class 
of implicit ordinary differential equations, in addition to those containing ra- 
tional function nonlinearities. The library O(X, «;,(t)) contains a subset of the 
columns of the library ©([X X]), which is obtained by building nonlinear 
functions of the state x and derivative x. Identifying the sparsest vector in 
the null space of ©([X X]) provides more flexibility in identifying nonlin- 
ear equations with mixed terms containing various powers of any combination 
of derivatives and states. For example, the system given by 


r? — tr- r =0 (7.59) 


may be represented as a sparse vector in the null space of ©([X_X]). This 
formulation may be extended to include higher-order derivatives in the library 
©, for example to identify second-order implicit differential equations: 


@([X X X)). (7.60) 


The generality of this approach enables the identification of many systems of 
interest, including those systems with rational function nonlinearities. 
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Figure 7.10: Illustration of model selection using SINDy and information crite- 
ria, as in Mangan et al. [479]. The most parsimonious model on the Pareto front 
is chosen to minimize the AIC score (blue circle), preventing overfitting. 


Information Criteria for Model Selection 


When performing the sparse regression in the SINDy algorithm, the sparsity- 
promoting parameter À is a free variable. In practice, different values of A will 
result in different models with various levels of sparsity, ranging from the triv- 
ial model x = 0 for very large à to the simple least-squares solution for à = 0. 
Thus, by varying åA, it is possible to sweep out a Pareto front, balancing error 
versus complexity, as in Fig. To identify the most parsimonious model, 
with low error and a reasonable complexity, it is possible to leverage informa- 
tion criteria for model selection, as described in Mangan et al. [479]. In particu- 
lar, if we compute the Akaike information criterion (AIC) [8}/9], which penalizes 
the number of terms in the model, then the most parsimonious model mini- 
mizes the AIC. This procedure has been applied to several sparse identification 
problems, and in every case the true model was correctly identified [479]. 


7.4 Koopman Operator Theory 


Koopman operator theory has recently emerged as an alternative perspective 
for dynamical systems in terms of the evolution of measurements g(x). In 1931, 
Bernard O. Koopman demonstrated that it is possible to represent a nonlinear 
dynamical system in terms of an infinite-dimensional linear operator acting 
on a Hilbert space of measurement functions of the state of the system. This 
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so-called Koopman operator is linear, and its spectral decomposition completely 
characterizes the behavior of a nonlinear system, analogous to (7.7). However, it 
is also infinite-dimensional, as there are infinitely many degrees of freedom re- 
quired to describe the space of all possible measurement functions g of the state. 
This poses new challenges. Obtaining finite-dimensional, matrix approxima- 
tions of the Koopman operator is the focus of intense research efforts and holds 
the promise of enabling globally linear representations of nonlinear dynami- 
cal systems. Expressing nonlinear dynamics in a linear framework is appealing 
because of the wealth of optimal estimation and control techniques available 
for linear systems (see Chapter|8) and the ability to analytically predict the fu- 
ture state of the system. Obtaining a finite-dimensional approximation of the 
Koopman operator has been challenging in practice, as it involves identifying 
a subspace spanned by a subset of eigenfunctions of the Koopman operator. 
For a more complete discussion of modern Koopman theory and data-driven 
approximations, see [128]. 


Mathematical Formulation of Koopman Theory 


The Koopman operator advances measurement functions of the state with the 
flow of the dynamics. We consider real-valued measurement functions g : M —> 
R, which are elements of an infinite-dimensional Hilbert space. The functions 
g are also commonly known as observables, although this may be confused with 
the unrelated observability from control theory. Typically, the Hilbert space is 
given by the Lebesgue square-integrable functions on M; other choices of a 
measure space are also valid. 

The Koopman operator K; is an infinite-dimensional linear operator that 
acts on measurement functions g as 


Kig=90F;, (7.61) 


where o is the composition operator. For a discrete-time system with time-step 
At, this becomes 


Katg(Xr) = 9(Fat(Xx)) = 9(Xe41)- (7.62) 
In other words, the Koopman operator defines an infinite-dimensional linear 


dynamical system that advances the observation of the state gą = g(x) to the 
next time-step: 


g(Xr41) = Karg(Xx)- (7.63) 
Note that this is true for any observable function g and for any state x,. 
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The Koopman operator is linear, a property that is inherited from the lin- 
earity of the addition operation in function spaces: 


Ki(argi(x) + a292(x)) = aigi(Fi(x)) + a292(F:(x)) (7.64a) 
= Kigi (x) + a2kygo(x). (7.64b) 


For sufficiently smooth dynamical systems, it is also possible to define the 
continuous-time analogue of the Koopman dynamical system in (7.63): 


d 

—g = Kg. 7.65 
ag = K9 (7.65) 
The operator K is the infinitesimal generator of the one-parameter family of 
transformations K; [4]. It is defined by its action on an observable function g: 


Kig-9 _ 1, 9°08: —9 
t>0 t ` 


(7.66) 


Ka = iag 


The linear dynamical systems in and are analogous to the dynami- 
cal systems in and (7.4), respectively. It is important to note that the orig- 
inal state x may be the observable, and the infinite-dimensional operator K: 
will advance this function. However, the simple representation of the observ- 
able g = x in a chosen basis for Hilbert space may become arbitrarily complex 
once iterated through the dynamics. In other words, finding a representation 
for Kx may not be simple or straightforward. 


Koopman Eigenfunctions and Intrinsic Coordinates 


The Koopman operator is linear, which is appealing, but is infinite-dimensional, 
posing issues for representation and computation. Instead of capturing the evo- 
lution of all measurement functions in a Hilbert space, applied Koopman anal- 
ysis attempts to identify key measurement functions that evolve linearly with 
the flow of the dynamics. Eigenfunctions of the Koopman operator provide just 
such a set of special measurements that behave linearly in time. In fact, a pri- 
mary motivation to adopt the Koopman framework is the ability to simplify 
the dynamics through the eigendecomposition of the operator. 

A discrete-time Koopman eigenfunction (x) corresponding to eigenvalue 
A satisfies 


(Xr+) = Karp(Xr) = AY(Xz). (7.67) 
In continuous time, a Koopman eigenfunction (x) satisfies 
d 
g? = Ke(x) = phx). (7.68) 


Obtaining Koopman eigenfunctions from data or from analytic expressions is 
a central applied challenge in modern dynamical systems. Discovering these 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


7.4. KOOPMAN OPERATOR THEORY 347 


eigenfunctions enables globally linear representations of strongly nonlinear sys- 
tems. 

Applying the chain rule to the time derivative of the Koopman eigenfunc- 
tion v(x) yields 


TP = Vols) = V(x) - f(x). 69) 


Combined with (7.68), this results in a partial differential equation for the eigen- 
function y(x): 

Vo(x) - f(x) = y(x). (7.70) 
With this nonlinear PDE, it is possible to approximate the eigenfunctions, either 
by solving for the Laurent series or with data via regression, both of which are 
explored below. This formulation assumes that the dynamics are both contin- 
uous and differentiable. The discrete-time dynamics in (7.4) are more general, 
although in many examples the continuous-time dynamics have a simpler rep- 
resentation than the discrete-time map for long times. For example, the simple 
Lorenz system has a simple continuous-time representation, yet is generally 
unrepresentable for even moderately long discrete-time updates. 

The key takeaway from and is that the nonlinear dynamics 
become completely linear in eigenfunction coordinates, given by y(x). As a 
simple example, any conserved quantity of a dynamical system is a Koopman 
eigenfunction corresponding to eigenvalue \ = 0. This establishes a Koopman 
extension of the famous Noether’s theorem [529], implying that any symmetry 
in the governing equations gives rise to a new Koopman eigenfunction with 
eigenvalue à = 0. For example, the Hamiltonian energy function is a Koop- 
man eigenfunction for a conservative system. In addition, the constant func- 
tion y = 1 is always a trivial eigenfunction corresponding to \ = 0 for every 
dynamical system. 


Eigenvalue Lattices. Interestingly, a set of Koopman eigenfunctions may be 
used to generate more eigenfunctions. In discrete time, we find that the product 
of two eigenfunctions ~\(x) and (x) is also an eigenfunction, 


Kili (x) p2(x)) = ei (F:(x))p2(F:(x)) (7.71a) 
= AyA2¥1 (x)po (x), (7.71b) 


corresponding to a new eigenvalue à1A2 given by the product of the two eigen- 
values of yı (x) and ~2(x). In continuous time, the relationship becomes 


d 
K(yi¢e) = ay Pye) (7.72a) 
= Pipe + pipe (7.72b) 
= Apia + A212 (7.72c) 
= (Ai + Az) Y1%2- (7.72d) 
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Interestingly, this means that the set of Koopman eigenfunctions establishes 
a commutative monoid under pointwise multiplication; a monoid has the struc- 
ture of a group, except that the elements need not have inverses. Thus, depend- 
ing on the dynamical system, there may be a finite set of generator eigenfunc- 
tion elements that may be used to construct all other eigenfunctions. The cor- 
responding eigenvalues similarly form a lattice, based on the product A;A2 or 
sum A; + Ag, depending on whether the dynamics are in discrete time or con- 
tinuous time. For example, given a linear system t = Az, then y(x) = x is an 
eigenfunction with eigenvalue à. Moreover, y“ = x“ is also an eigenfunction 
with eigenvalue a. for any a. 

The continuous-time and discrete-time lattices are related in a simple way. If 
the continuous-time eigenvalues are given by å, then the corresponding discrete- 
time eigenvalues are given by e™%. Thus, the eigenvalue expressions in (7.71b) 


and (7.72d) are related as 
ee! o, (x) po(x) = e^t (x) ya(x). (7.73) 


As another simple demonstration of the relationship between continuous- 
time and discrete-time eigenvalues, consider the continuous-time definition in 
(7.66) applied to an eigenfunction: 

K — At _ 
lim tpl) — P(x) _ lim © io ath 9 Ay(x). (7.74) 
t0 t t30 t 


Koopman Mode Decomposition and Finite Representations 


Until now, we have considered scalar measurements of a system, and we un- 
covered special eigen-measurements that evolve linearly in time. However, we 
often take multiple measurements of a system. In extreme cases, we may mea- 
sure the entire state of a high-dimensional spatial system, such as an evolving 
fluid flow. These measurements may then be arranged in a vector g: 


g(x) = (7.75) 


Each of the individual measurements may be expanded in terms of the eigen- 
functions y;(x), which provide a basis for Hilbert space: 


gi(X) = > vayl) (7.76) 
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Thus, the vector of observables, g, may be similarly expanded: 


g(x) e 
g(x) = = dL 2) (7.77) 
sw) T 


where v; is the jth Koopman mode associated with the eigenfunction ;. 

For conservative dynamical systems, such as those governed by Hamilto- 
nian dynamics, the Koopman operator is unitary on the Hilbert space of square- 
integrable functions. Thus, the Koopman eigenfunctions are orthonormal for 
conservative systems, and it is possible to compute the Koopman modes v; 
directly by projection: 


(Pig) 


P , (7.78) 


yj = 


TA Jp) 


where (-, -) is the standard inner product of functions in Hilbert space. Thus, the 
expansion of the observable function in may be thought of as a change of 
basis into eigenfunction coordinates. These Koopman modes have a physical 
interpretation in the case of direct spatial measurements of a system, g(x) = x, 
in which case the modes are coherent spatial modes that behave linearly with 
the same temporal dynamics (i.e., oscillations, possibly with linear growth or 
decay). These Koopman modes v are also known as dynamic modes in DMD. 

Given the decomposition in (7.77), it is possible to represent the dynamics 
of the measurements g as follows: 


g(x) = Kigo) = KA, X 9; (Xo) Vv; (7.79) 
j=0 
= So Kup; (Xo) v5 (7.79b) 
j=0 
= S > Ap; (xo) vy, (7.79c) 
j=0 


where KX, is the Koopman operator Ka: applied k times. This sequence of 
triples, {(Aj, 25, Vj) }32o is known as the Koopman mode decomposition, and was 
introduced by Mezié in 2005 [497]. Often, it is possible to approximate this 
expansion as a truncated sum of only a few dominant terms. The Koopman 
mode decomposition was later connected to data-driven regression via the dy- 
namic mode decomposition [611], which was discussed in Section The 
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DMD eigenvalues approximate the Koopman eigenvalues ,, the DMD modes 
approximate the Koopman modes v,, and the DMD mode amplitudes approxi- 
mate the corresponding Koopman eigenfunctions evaluated at the initial condi- 
tion y; (xo). In fact, the Koopman mode decomposition in is nearly identi- 
cal to the DMD spectral expansion in (7.28), with the DMD mode amplitudes b; 
replaced with the Koopman eigenfunctions y;(x 9) evaluated at the initial con- 
dition, and the DMD modes $; replaced with the Koopman modes v;. It is im- 
portant to note that the Koopman modes and eigenfunctions are distinct math- 
ematical objects, requiring different approaches for approximation. Koopman 
eigenfunctions are often more challenging to compute than Koopman modes, 
motivating advanced techniques, such as the extended DMD algorithm 
in Section 7.5] 


Invariant Eigenspaces and Finite-Dimensional Models 


Instead of capturing the evolution of all measurement functions in a Hilbert 
space, applied Koopman analysis approximates the evolution on an invariant 
subspace spanned by a finite set of measurement functions. 

A Koopman-invariant subspace is defined as the span of a set of functions 
{91, 92, - - - , 9p} if all functions g in this subspace, 


g = 0191 + Q292 + ` © + Qp9p, (7.80) 


remain in this subspace after being acted on by the Koopman operator K: 


Kg = b191 + Boge +---+ BpIp- (7.81) 


It is possible to obtain a finite-dimensional matrix representation of the Koop- 
man operator by restricting it to an invariant subspace spanned by a finite num- 
ber of functions {g; }j-o; this is illustrated in Fig.[7.11| The matrix representation 
K acts on a vector space R”, with the coordinates given by the values of g;(x). 
This induces a finite-dimensional linear system, as in and (7.65). 

Any finite set of eigenfunctions of the Koopman operator will span an in- 
variant subspace. Discovering these eigenfunction coordinates is, therefore, a 
central challenge, as they provide intrinsic coordinates along which the dy- 
namics behave linearly. In practice, it is more likely that we will identify an 
approximately invariant subspace, given by a set of functions {g;};-o where 
each of the functions g; is well approximated by a finite sum of eigenfunctions: 


Vag Pp 
gj X X k0 QAkPk. 
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F;: Tk Ler 


g : Ek YE 


K: : YK Ykyı 


Figure 7.11: Schematic illustrating the Koopman operator for nonlinear dynam- 
ical systems. The dashed lines from y;, — x+ indicate that we would like to be 
able to recover the original state. 


Examples of Koopman Embeddings 


Nonlinear System with Single Fixed Point and a Slow Manifold 


Here, we consider an example system with a single fixed point, given by 


tı = (is (7.82a) 
to = A(z — 2). (7.82b) 


For À < u < 0, the system exhibits a slow attracting manifold given by x2 = 27. 
It is possible to augment the state x with the nonlinear measurement g = x7, to 
define a three-dimensional Koopman-invariant subspace. In these coordinates, 
the dynamics become linear: 


q |% uO 0 yı yı Tı 
q || = 0 A A] [y2] for fy] = |r] . (7.83) 
Y3 0 0 2u] Lys Y3 z? 


The full three-dimensional Koopman observable vector space is visualized 
in Fig. Trajectories that start on the invariant manifold y3 = y?, visualized 
by the blue parabolic surface, are constrained to stay on this manifold. There is 
a slow subspace, spanned by the eigenvectors corresponding to the slow eigen- 
values u and 2u; this subspace is visualized by the green planar surface. Finally, 
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Y3 


Figure 7.12: Visualization of three-dimensional linear Koopman system from 
(7.83) along with projection of dynamics onto the xı—x> plane. The attracting 
slow manifold is shown in red, the constraint y = y? is shown in blue, and 
the slow unstable subspace of is shown in green. Black trajectories of 
the linear Koopman system in y project onto trajectories of the full nonlinear 
system in x in the yı—y2 plane. Here, u = —0.05 and à = 1. Reproduced from 
Brunton et al. [127]. 


there is the original asymptotically attracting manifold of the original system, 
y2 = yj, which is visualized as the red parabolic surface. The blue and red 
parabolic surfaces always intersect in a parabola that is inclined at a 45° an- 
gle in the y2-y3 direction. The green surface approaches this 45° inclination as 
the ratio of fast to slow dynamics become increasingly large. In the full three- 
dimensional Koopman observable space, the dynamics produce a single sta- 
ble node, with trajectories rapidly attracting onto the green subspace and then 
slowly approaching the fixed point. 


Intrinsic Coordinates Defined by Eigenfunctions of the Koopman Operator. 
The left eigenvectors of the Koopman operator yield Koopman eigenfunctions 
(i.e., eigen-observables). The Koopman eigenfunctions of (7.83) corresponding 
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to eigenvalues u and À are 


A 

A — 2u 
The constant b in p) captures the fact that for a finite ratio \/j1, the dynamics 
only shadow the asymptotically attracting slow manifold x = x7, but in fact 
follow neighboring parabolic trajectories. This is illustrated more clearly by the 
various surfaces in Fig. [7.12] for different ratios À/ u. 

In this way, a set of intrinsic coordinates may be determined from the ob- 
servable functions defined by the left eigenvectors of the Koopman operator on 
an invariant subspace. Explicitly, 


Yalx) = €,y(x), where €,K =aé&,. (7.85) 


These eigen-observables define observable subspaces that remain invariant un- 
der the Koopman operator, even after coordinate transformations. As such, 
they may be regarded as intrinsic coordinates [757] on the Koopman-invariant 
subspace. 


Yi=t, and p= zz-—br? with b= (7.84) 


Example of Intractable Representation 


Consider the logistic map, given by 


Dear = Pry l= Tk). (7.86) 
Let our observable subspace include x and 2”: 
_ | av A |Xk 
(ELE om 
Writing out the Koopman operator, the first row equation is simple: 
ki 6 —B| |a 
y= [2] e 7/2]. (7.88) 
hea E P 
but the second row is not obvious. To find this expression, expand 27, ;: 
Thy = (Bax(1 — ae))” = B’ (2k — 22% + ah). (7.89) 


Thus, cubic and quartic polynomial terms are required to advance x°. Similarly, 
these terms need polynomials up to sixth and eighth order, respectively, and so 
on, ad infinitum: 


av x? x x x rË x rè z? gt 
x B -B 0 0 0 0 0 0 0 0 x 
z? 0 8B -282 r? 0 0 0 0 0 0 z? 
a 0 0 6 -38 3683 æ 0 0 0 0 z? 
z| =|0 0 0 Bt 46t 664 464 g 0 0 at 
a 0 0 0 0 p> 565 1085 —-1085 585 -65 r? 
k41 : $ 2 s : & : Hi : H Pie Ih 
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It is interesting to note that the rows of this equation are related to the rows of 
Pascal’s triangle, with the nth row scaled by r”, and with the omission of the 
first row: 


[e°] k+1 [0] le (7.90) 


The above representation of the Koopman operator in a polynomial basis is 
somewhat troubling. Not only is there no closure, but the determinant of any 
finite-rank truncation is very large for 3 > 1. This illustrates a pitfall associated 
with naive representation of the infinite-dimensional Koopman operator for a 
simple chaotic system. Truncating the system, or performing a least-squares fit 
on an augmented observable vector (i.e, DMD on a nonlinear measurement; 
see Section 7.5), yields poor results, with the truncated system only agreeing 
with the true dynamics for a small handful of iterations, as the complexity of 
the representation grows quickly: 


1 J0 0 0 0 

x jl p p? 8? 

e fo J-a P= —p* — pt — pP 

x3 J0 0 283 28t + 285 +266 
zt 0] k | O} x =p x |84- 8 -6p -— p" 
zê |0 0 0 —2 66 = 687 

x” 10 0 0 48" 

xë 10 0 0 


Analytic Series Expansions for Eigenfunctions 


Given the dynamics in (7.1), it is possible to solve the PDE in (7.70) using stan- 
dard techniques, such as recursively solving for the terms in a Taylor or Laurent 
series. A number of simple examples are explored below. 


Linear Dynamics 


Consider the simple linear dynamics 


ca =p. (7.92) 
Assuming a Taylor series expansion for y(x): 


p(x) = co + cT + con? + ct? +---, 
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then the gradient and directional derivatives are given by 


Vo =c1 + 2cor + 3c3£? + deax? +--+, 
Ve j f = Git + Qcox" + 3c32° + Acyx* deras 
Solving for terms in the Koopman eigenfunction PDE (7.70), we see that co = 0 
must hold. For any positive integer A in (7.70), only one of the coefficients may 


be non-zero. Specifically, for A = k € Z+, then y(x) = cx" is an eigenfunction 
for any constant c. For instance, if A = 1, then y(x) = z. 


Quadratic Nonlinear Dynamics 


Consider a nonlinear dynamical system 
“a =7. (7.93) 


There is no Taylor series that satisfies (7.70), except the trivial solution y = 0 
for \ = 0. Instead, we assume a Laurent series: 


gla) =- +c’ + cor? + cer! + co 


HT + or? + cr? +. 


The gradient and directional derivatives are given by 


3 2 


Vy = => — B8e49 * — 26.99? = 647 
+ c1 + 2eox + 8c3u? + 4eyx? +--+, 
Ve- f=- — Scag 26 sg — C1 


+ ea? + Qeoa? + 3c3x* + 4er” +--+. 


Solving for the coefficients of the Laurent series that satisfy (7.70), we find that 
all coefficients with positive index are zero, i.e., c = 0 for all k > 1. However, 
the non-positive index coefficients are given by the recursion Acj41 = kcx, for 
negative k < —1. Thus, the Laurent series is 


a A3 l 
y(t) = co | 1 = ArT! + Da? ir H) = ee”. 
2 3! 


This holds for all values of \ € C. There are also other Koopman eigenfunctions 
that can be identified from the Laurent series. 


Polynomial Nonlinear Dynamics 


For a more general nonlinear dynamical system 


cn =a (7.94) 
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we have that 


is an eigenfunction for all À € C. 


As mentioned above, it is also possible to generate new eigenfunctions by tak- 
ing powers of these primitive eigenfunctions; the resulting eigenvalues gener- 
ate a lattice in the complex plane. 


History and Recent Developments 


The original analysis of Koopman in 1931 was introduced to describe the evolu- 
tion of measurements of Hamiltonian systems [402], and this theory was gener- 
alized by Koopman and von Neumann to systems with continuous eigenvalue 
spectrum in 1932 [403]. In the case of Hamiltonian flows, the Koopman opera- 
tor K; is unitary, and forms a one-parameter family of unitary transformations 
in Hilbert space. Unitary operators should be familiar by now, as the discrete 
Fourier transform (DFT) and the singular value decomposition (SVD) both 
provide unitary coordinate transformations. Unitarity implies that the inner 
product of any two observable functions remains unchanged through action of 
the Koopman operator, which is intuitively related to the phase-space volume- 
preserving property of Hamiltonian systems. In the original paper [402], Koop- 
man drew connections between the Koopman eigenvalue spectrum and con- 
served quantities, integrability, and ergodicity. Interestingly, Koopman’s 1931 
paper was central in the celebrated proofs of the ergodic theorem by Birkhoff 
and von Neumann [88] [89] {510! [521]. 

Koopman analysis has recently gained renewed interest with the pioneering 
work of Mezi¢ and collaborators 500]. The Koop- 
man operator is also known as the composition operator, which is formally 
the pull-back operator on the space of scalar observable functions [4], and it is 
the dual, or left-adjoint, of the Perron—Frobenius operator, or transfer operator, 
which is the push-forward operator on the space of probability density func- 
tions. When a polynomial basis is chosen to represent the Koopman operator, 
then it is closely related to Carleman linearization [165], which has 
been used extensively in nonlinear control [51] |408! |674| [686]. Koopman analy- 
sis is also connected to the resolvent operator theory from fluid dynamics [655]. 

Recently, it has been shown that the operator-theoretic framework comple- 
ments the traditional geometric and probabilistic perspectives. For example, 
level sets of Koopman eigenfunctions form invariant partitions of the state 
space of a dynamical system [139]; in particular, eigenfunctions of the Koop- 
man operator may be used to analyze the ergodic partition [138501]. Koopman 
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analysis has also been recently shown to generalize the Hartman—Grobman the- 
orem to the entire basin of attraction of a stable or unstable equilibrium point 
or periodic orbit [427]. 

At the time of this writing, representing Koopman eigenfunctions for gen- 
eral dynamical systems remains a central unsolved challenge. Significant re- 
search efforts are focused on developing data-driven techniques to identify 
Koopman eigenfunctions and use these for control, which will be discussed in 
the following sections and chapters. Recently, new work has emerged that at- 
tempts to leverage the power of deep learning to discover and represent eigen- 


functions from data [465 766]. 


7.5 Data-Driven Koopman Analysis 


Obtaining linear representations for strongly nonlinear systems has the po- 
tential to revolutionize our ability to predict and control these systems. The 
linearization of dynamics near fixed points or periodic orbits has long been 
employed for local linear representation of the dynamics [834]. The Koopman 
operator is appealing because it provides a global linear representation, valid 
far away from fixed points and periodic orbits. However, previous attempts to 
obtain finite-dimensional approximations of the Koopman operator have had 
limited success. Dynamic mode decomposition seeks to approxi- 
mate the Koopman operator with a best-fit linear model advancing spatial mea- 
surements from one time to the next, although these linear measurements are 
not rich enough for many nonlinear systems. Augmenting DMD with nonlin- 
ear measurements may enrich the model, but there is no guarantee that the 
resulting models will be closed under the Koopman operator [127]. Here, we 
describe several approaches for identifying Koopman embeddings and eigen- 
functions from data. These methods include the extended dynamic mode de- 
composition [757], extensions based on SINDy [365], and the use of delay coor- 
dinates [126]. 


Extended DMD 


The extended DMD algorithm is essentially the same as standard DMD 
[727], except that, instead of performing regression on direct measurements of 
the state, regression is performed on an augmented vector containing nonlin- 
ear measurements of the state. As discussed earlier, eDMD is equivalent to the 
variational approach of conformation dynamics 1534! [535], which was de- 
veloped in 2013 by Noé and Ntiske. 

Here, we will modify the notation slightly to conform to related methods. 
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In eDMD, an augmented state is constructed: 


0, (x) 


y= = |") (7.98) 


6, (x) 


Here © may contain the original state x as well as nonlinear measurements, so 
often p >> n. Next, two data matrices are constructed, as in DMD: 


| | 
Y= ‘i ‘i = Yl Y'= |y2 Y3 > Yml. (7.96a) 


Finally, a best-fit linear operator Ay is constructed that maps Y into Y”: 


Ay = argmin|/Y’ — Ay Y|| = Y'Yİ. (7.97) 
Ay 


This regression may be written in terms of the data matrices O(X) and O(X’): 


Ay = argmin |O" (X’) — Ay@"(X)|| = 67 (X’)(@7(X))!. (7.98) 
Ay 


Because the augmented vector y may be significantly larger than the state 
x, kernel methods are often employed to compute this regression [758]. In prin- 
ciple, the enriched library © provides a larger basis in which to approximate 
the Koopman operator. It has been shown recently that, in the limit of infinite 
snapshots, the extended DMD operator converges to the Koopman operator 
projected onto the subspace spanned by © [405]. However, if © does not span 
a Koopman-invariant subspace, then the projected operator may not have any 
resemblance to the original Koopman operator, as all of the eigenvalues and 
eigenvectors may be different. In fact, it was shown that the extended DMD op- 
erator will have spurious eigenvalues and eigenvectors unless it is represented 
in terms of a Koopman-invariant subspace [127]. Therefore, it is essential to use 
validation and cross-validation techniques to ensure that eDMD models are 
not overfit, as discussed below. For example, it was shown that eDMD cannot 
contain the original state x as a measurement and represent a system that has 
multiple fixed points, periodic orbits, or other attractors, because these systems 
cannot be topologically conjugate to a finite-dimensional linear system [127]. 
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Approximating Koopman Eigenfunctions from Data 


In discrete time, a Koopman eigenfunction y(x) evaluated at a number of data 
points in X will satisfy: 


pla) pee) 
relma) |_| 6%) | a 
Nlm) (olma) 


It is possible to approximate this eigenfunction as an expansion in terms of a 
set of candidate functions, 


O(x) = [A1(x) O2(x) = 4)(x)]. (7.100) 


The Koopman eigenfunction may be approximated in this basis as 


plx) ~ X` Ok(x)E = O(X)E. (7.101) 
k=l 
Writing in terms of this expansion yields the matrix system: 
(A@(X) — O(X'))E = 0. (7.102) 


If we seek the best least-squares fit to (7.102), this reduces to the extended DMD 
757) |758] formulation: 
AE = O(X)'O(XE. (7.103) 


Note that is the transpose of (7.98), so that left eigenvectors become 
right eigenvectors. Thus, the eigenvectors £ of O'O’ yield the coefficients of the 
eigenfunction y(x) represented in the basis O(x). It is absolutely essential then 
to confirm that predicted eigenfunctions actually behave linearly on trajecto- 
ries, by comparing them with the predicted dynamics y;.1 = Ax, because the 
regression above will result in spurious eigenvalues and eigenvectors unless 
the basis elements 6; span a Koopman-invariant subspace [127]. 


Sparse Identification of Eigenfunctions 


It is possible to leverage the SINDy regression to identify Koopman eigen- 
functions corresponding to a particular eigenvalue åA, selecting only the few ac- 
tive terms in the library ©(x) to avoid overfitting. Given the data matrices, X 
and X from above, it is possible to construct the library of basis functions @(X) 
as well as a library of directional derivatives, representing the possible terms in 


V(x) - f(x) from (7.70): 
I(x, x) = [V0 (x): x V6o(x)-x +++ V6,(x)-x]. (7.104) 
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It is then possible to construct T from data: 


VO. (x1) X = VOq(k1)- Xi e V(x): xı 
pai- VAD Vaba om VAD 
vied Ce a V a 
For a given eigenvalue , the Koopman PDE in (7.70) may be evaluated on data: 
(\@(X) -T(X, X))é = 0. (7.105) 


The formulation in ( is implicit, so that € will be in the null space of 
\@(X) — r(X, X). The right null space of (7.105) for a given \ is spanned by the 
right singular vectors of \O(X) — T(X, X) = UDV* (i.e., columns of V) corre- 
sponding to zero-valued singular values. It may be possible to identify the few 
active terms in an eigenfunction by finding the sparsest vector in the null space 
[576], as in the implicit-SINDy algorithm described in Section [7.3} In this 
formulation, the eigenvalues \ are not known a priori, and must be learned with 
the approximate eigenfunction. Koopman eigenfunctions and eigenvalues can 
also be determined as the solution to the eigenvalue problem Ay, = Aafa 
where Ay = O'T is obtained via least-squares regression, as in the continuous- 
time version of eDMD. While many eigenfunctions are spurious, those corre- 
sponding to lightly damped eigenvalues can be well approximated. 

From a practical standpoint, data in X does not need to be sampled from 
full trajectories, but can be obtained using more sophisticated strategies such 
as Latin hypercube sampling or sampling from a distribution over the phase 
space. Moreover, reproducing kernel Hilbert spaces (RKHS) can be employed 
to describe y(x) locally in patches of state space. 


Example: Duffing System (Kaiser et al. [365]) We demonstrate the sparse 
identification of Koopman eigenfunctions on the undamped Duffing oscillator: 


where z; is the position and zə is the velocity of a particle in a double-well 
potential with equilibria (0,0) and (+1,0). This system is conservative, with 
Hamiltonian H = $23 — ix? + 42}. The Hamiltonian, and in general any con- 
served quantity, is a Koopman jae with zero eigenvalue. 

For the eigenvalue \ = 0, (7.105) becomes —I'(X, X)€ = 0, and hence a 
sparse € is sought in the null o a -T(X, X). A library of candidate func- 


tions is constructed from data, employing polynomials up to fourth order: 


O(X) = fait) xat) zit) xilt)wa(t) «++ 13(t) 
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and 


T(X,X) = |t) tat) 2ei(t\er(t) xo(t)er(t) + ei(t)aeo(t) --- 4ra(t) i(t) 


A sparse vector of coefficients € may be identified, with the few non-zero 
entries determining the active terms in the Koopman eigenfunction. The iden- 
tified Koopman eigenfunction associated with À = 0 is 


p(x) = -2r + 2054+ iri. (7.106) 


This eigenfunction matches the Hamiltonian perfectly up to a constant scaling. 


Data-Driven Koopman and Delay Coordinates 


Instead of advancing instantaneous linear or nonlinear measurements of the 
state of a system directly, as in DMD, it may be possible to obtain intrinsic mea- 
surement coordinates for Koopman based on time-delayed measurements of 
the system [681]. This perspective is data-driven, relying on the 
wealth of information from previous measurements to inform the future. Un- 
like a linear or weakly nonlinear system, where trajectories may get trapped at 
fixed points or on periodic orbits, chaotic dynamics are particularly well suited 
to this analysis: trajectories evolve to densely fill an attractor, so more data pro- 
vides more information. The use of delay coordinates may be especially impor- 
tant for systems with long-term memory effects, where the Koopman approach 
has recently been shown to provide a successful analysis tool [685]. 

The time-delay measurement scheme is shown schematically in Fig. 
as illustrated on the Lorenz system for a single time series of the first variable, 
z(t). The conditions of the Takens embedding theorem are satisfied [694], so 
it is possible to obtain a diffeomorphism between a delay-embedded attractor 
and the attractor in the original coordinates. We then obtain eigen-time-delay 
coordinates from a time series of a single measurement x(t) by taking the SVD 
of the Hankel matrix H: 


x(t) (ta) x(tp) 
H= a ie e OT L usv, (7.107) 
Z(tq) T(tq+1) tox L(tm) 


The columns of U and V from the SVD are arranged hierarchically by their 
ability to model the columns and rows of H, respectively. Often, H may admit 
a low-rank approximation by the first r columns of U and V. Note that the Han- 
kel matrix in is the basis of the eigensystem realization algorithm 
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Figure 7.13: Decomposition of chaos into a linear system with forcing. A time 
series x(t) is stacked into a Hankel matrix H. The SVD of H yields a hierarchy 
of eigen-time series that produce a delay-embedded attractor. A best-fit linear 
regression model is obtained on the delay coordinates v; the linear fit for the 
first r — 1 variables is excellent, but the last coordinate v, is not well modeled 
as linear. Instead, v, is an input that forces the first r — 1 variables. Rare forc- 
ing events correspond to lobe switching in the chaotic dynamics. This architec- 
ture is called the Hankel alternative view of Koopman (HAVOK) analysis, from 
[126]. Modified from Brunton et al. [126]. 


in linear system identification (see Section 9.3) and singular spectrum analysis 
(SSA) in climate time-series analysis. 

The low-rank approximation to provides a data-driven measurement 
system that is approximately invariant to the Koopman operator for states on 
the attractor. By definition, the dynamics map the attractor into itself, making 
it invariant to the flow. In other words, the columns of U form a Koopman- 
invariant subspace. We may rewrite with the Koopman operator K = 
K At: 


x(t1) Kotta) Ias w 
_ Da . a = a (7.108) 
Kelt) Kirti) ++ Km 1n(t,) 


The columns of (7.107) are well approximated by the first r columns of U. The 
first r columns of V provide a time series of the magnitude of each of the 
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columns of UX in the data. By plotting the first three columns of V, we ob- 
tain an embedded attractor for the Lorenz system (see Fig. |7.13). 

The connection between eigen-time-delay coordinates from and the 
Koopman operator motivates a linear regression model on the variables in V. 
Even with an approximately Koopman-invariant measurement system, there 
remain challenges to identifying a linear model for a chaotic system. A linear 
model, however detailed, cannot capture multiple fixed points or the unpre- 
dictable behavior characteristic of chaos with a positive Lyapunov exponent 
[127]. Instead of constructing a closed linear model for the first r variables in 
V, we build a linear model on the first r—1 variables and recast the last variable, 
Ur, as a forcing term: 


O = Av(t) + Bv, (t), (7.109) 
where v = [v v = Ui] is a vector of the first r — 1 eigen-time-delay 
coordinates. Other work has investigated the splitting of dynamics into deter- 
ministic linear and chaotic stochastic dynamics [497]. 

In all of the examples explored in [126], the linear model on the first r — 1 
terms is accurate, while no linear model represents v,. Instead, v, is an input 
forcing to the linear dynamics in (7.109), which approximates the nonlinear 
dynamics. The statistics of v,.(t) are non-Gaussian, with long tails correspond- 
ing to rare-event forcing that drives lobe switching in the Lorenz system; this 
is related to rare-event forcing distributions observed and modeled by others 
[618]. The forced linear system in was discovered after ap- 
plying the SINDy algorithm to delay coordinates of the Lorenz system. 
Continuing to develop Koopman on delay coordinates has significant promise 
in the context of closed-loop feedback control, where it may be possible to ma- 
nipulate the behavior of a chaotic system by treating v, as a disturbance. 

In addition, the use of delay coordinates as intrinsic measurements for Koop- 
man analysis suggests that Koopman theory may also be used to improve spa- 
tially distributed sensor technologies. A spatial array of sensors, for example 
the O(100) strain sensors on the wings of flying insects, may use phase delay 
coordinates to provide nearly optimal embeddings to detect and control con- 
vective structures (e.g., stall from a gust, leading-edge vortex formation and 
convection, etc.). 


History of Delay Embeddings for Dynamics 


The Hankel matrix has been used for decades in system identification, for ex- 
ample in the eigensystem realization algorithm (ERA) and the singular 
spectrum analysis (SSA) [120]. These early algorithms were developed specifi- 
cally for linear systems, and although they were often applied to weakly non- 
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linear systems, it was unclear how to interpret the resulting models and de- 
compositions. Modern Koopman operator theory has provided a valuable new 
perspective for how to interpret the results of these classical Hankel-based ap- 
proaches when applied to nonlinear systems. Computing DMD on a Hankel 
matrix was first introduced by Tu et al. and was used by B. Brunton 
et al. in the field of neuroscience. The connection between the Hankel 
matrix and the Koopman operator, along with the linear regression models in 
(7.109), was established by Brunton et al. in the Hankel alternative view 
of Koopman (HAVOK) framework. Several subsequent works have provided 
additional theoretical foundations for this approach 371]. 
Hirsh et al. established connections between HAVOK and the Frenet- 
Serret frame from differential geometry, motivating a more accurate compu- 
tational modeling approach. The HAVOK approach is also often referred to 
as delay-DMD or Hankel-DMD [25]. A connection between delay embed- 
dings and the Koopman operator was established as early as 2004 by Mezi¢ and 
Banaszuk [500], where a stochastic Koopman operator is defined and a statisti- 
cal Takens theorem is proven. Other work has investigated the splitting of dy- 
namics into deterministic linear and chaotic stochastic dynamics [497]. The use 
of delay coordinates may be especially important for systems with long-term 
memory effects and where the Koopman approach has recently been shown to 
provide a successful analysis tool [685]. 


HAVOK Code for Lorenz System 


Code 7.6] below generates a HAVOK model for the same Lorenz system data 
generated in Code[7.2| Here we use At = 0.01, m, = 10, and r = 10, although 
the results would be more accurate for At = 0.001, m, = 100, and r = 15. 


Code 7.6: [MATLAB] HAVOK code for Lorenz data generated in Section|7.1| 


s0 EIGEN-TIME DELAY COORDINATES 
stackmax = 10; Number of shift-stacked rows 
r=10; Rank of HAVOK Model 
H = zeros (stackmax, size (x,1)—-stackmax) ; 
for k—l:stackmax 
H(k,:) = x(k:end-stackmax-1+k,1); 


oe o 


end 
[U,S,V] = svd(H,’econ’); % Eigen delay coordinates 


35 COMPUTE DERIVATIVES (4TH ORDER CENTRAL DIFFERENCE) 
dV = zeros (length (V)-5,r); 
for i=3:length(V)-3 
for k l:r 
dV (i-2,k) = (1/ (12«dt)) * (-V(i+2,k)4+8«V(i+1,k)-8*V(i 
Se ere ail s)) ie 
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end 
end 
© trim first and last two that are lost in derivative 
y= y(3:end-3,1:r); 


%3 BUILD HAVOK REGRESSION MODEL ON TIME DELAY COORDINATES 
Xi = V\dv; 

= 2 l(b atemdk yy deel) ee 

B= Kas(end ese On; 


D 


Code 7.6: [Python] HAVOK code for Lorenz data generated in Section [7.1} 


## Eigen-time delay coordinates 

stackmax = 10 # Number of shift-stacked rows 
r = 10 # rank of HAVOK model 

H = np.zeros ((stackmax,x.shape[0]-stackmax)) 


for k in range (stackmax): 


Hike l = x lkey—(steckmax—k:) "0i] 
Uro Vv np- linalg.syvya(H, tull matrices- 0) 
MESVETE 


## Compute Derivatives (4th Order Central Difference) 

dV = (1/ (12edt))* (-V[4:, :]+84V([3:-1, mey = ANEA 
# trim first and last two that are lost in derivative 

Vy = V[2:-2] 


## Build HAVOK Regression Model on Time Delay Coordinates 
Xi = np.linalg.lstsq(V,dV, rcond=None) [0] 

AE erie (se (Gs sa) ee a ae — al) all senile 

E eo Aak (dle Be) ne 


Neural Networks for Koopman Embeddings 


Despite the promise of Koopman embeddings, obtaining tractable representa- 
tions has remained a central challenge. Recall that, even for relatively simple 
dynamical systems, the eigenfunctions of the Koopman operator may be arbi- 
trarily complex. Deep learning, which is well suited for representing arbitrary 
functions, has recently emerged as a promising approach for discovering and 
representing Koopman eigenfunctions [[440}/465)/485} |540) |6921|747]|766], provid- 
ing a data-driven embedding of strongly nonlinear systems into intrinsic linear 
coordinates. In particular, the Koopman perspective fits naturally with the deep 
autoencoder structure discussed in Chapter |6| where a few key latent variables 
y = (x) are discovered to parameterize the dynamics. In a Koopman net- 
work, an additional constraint is enforced so that the dynamics must be linear 
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Autoencoder: 
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Encoder y = g(x) Decoder x = g~! (y) 
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Figure 7.14: Deep neural network architecture used to identify Koopman eigen- 
functions (x). The network is based on a deep autoencoder (top), which iden- 
tifies intrinsic coordinates y = (x). Additional loss functions are included 
to enforce linear dynamics in the autoencoder variables (bottom). Reproduced 
with permission from Lusch et al. [465]. 


on these latent variables, forcing the functions (x) to be Koopman eigenfunc- 
tions, as illustrated in Fig. The constraint of linear dynamics is enforced by 
the loss function ||o(x,+41) — Ky(x;)||, where K is a matrix. In general, linearity 
is enforced over multiple time-steps, so that a trajectory is captured by iterating 
K on the latent variables. In addition, it is important to be able to map back to 
physical variables x, which is why the autoencoder structure is favorable [465]. 
Variational autoencoders are also used for stochastic dynamical systems, such 
as molecular dynamics, where the map back to physical configuration space 
from the latent variables is probabilistic 485) |747]. 

For simple systems with a discrete eigenvalue spectrum, a compact repre- 
sentation may be obtained in terms of a few autoencoder variables. However, 
dynamical systems with continuous eigenvalue spectra defy low-dimensional 
representations using many existing neural network or Koopman representa- 
tions. Continuous spectrum dynamics are ubiquitous, ranging from the sim- 
ple pendulum to nonlinear optics and broadband turbulence. For example, the 
classical pendulum, given by 


¢ = —sin(wz), (7.110) 


exhibits a continuous range of frequencies, from w to 0, as the amplitude of the 
pendulum oscillation is increased. Thus, the continuous spectrum confounds 
a simple description in terms of a few Koopman eigenfunctions [499]. Indeed, 
away from the linear regime, an infinite Fourier sum is required to approximate 
the shift in frequency. 
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Figure 7.15: Modified network architecture with auxiliary network to parame- 
terize the continuous eigenvalue spectrum. A continuous eigenvalue à enables 
aggressive dimensionality reduction in the autoencoder, avoiding the need for 
higher harmonics of the fundamental frequency that are generated by the non- 
linearity. Reproduced with permission from Lusch et al. [465]. 


In a recent work by Lusch et al. [465], an auxiliary network is used to pa- 
rameterize the continuously varying eigenvalue, enabling a network structure 
that is both parsimonious and interpretable. This parameterized network is de- 
picted schematically in Fig. and illustrated on the simple pendulum in 
Fig. In contrast to other network structures, which require a large autoen- 
coder layer to encode the continuous frequency shift with an asymptotic ex- 
pansion in terms of harmonics of the natural frequency, the parameterized net- 
work is able to identify a single complex conjugate pair of eigenfunctions with 
a varying imaginary eigenvalue pair. If this explicit frequency dependence is 
unaccounted for, then a high-dimensional network is necessary to account for 
the shifting frequency and eigenvalues. 

It is expected that neural network representations of dynamical systems, 
and Koopman embeddings in particular, will remain a growing area of interest 
in data-driven dynamics. Combining the representational power of deep learn- 
ing with the elegance and simplicity of Koopman embeddings has the potential 
to transform the analysis and control of complex systems. 
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Figure 7.16: Neural network embedding of the nonlinear pendulum, using the 
parameterized network in Fig. As the pendulum amplitude increases, the 
frequency continuously changes (I). In the Koopman eigenfunction coordinates 
(III), the dynamics become linear, given by perfect circles (II.C). Reproduced 
with permission from Lusch et al. [? ] 
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Homework 


Exercise 7-1. Create a similar bifurcation diagram to the logistic map, but for 
the hat map, given by 


); 
|. 


PEIEE BxE for £p € [0, 3 
le B/2— Bx, for zk € [5,1 


For what values of 8 is this map chaotic? 


Exercise 7-2. This exercise will explore how to compare trajectories of chaotic 
systems. For chaotic systems, simply comparing the solutions becomes diffi- 
cult, because even minuscule changes in the initial conditions can give rise 
to entirely different solutions because of exponential growth of these small 
changes in time. Instead, it is often more natural to compare the probability 
distributions of the chaotic attractor. 


(a) Generate two trajectories with nearby initial conditions, within an initial 
distance of 1 x 1076, for the Lorenz system with the standard parameters. 
Plot these two time series. Compute the error between the two trajectories 
as a function of time and explain the trend. 


(b) For a given point on the attractor, find the perturbation direction that 
gives the largest error for T = 1 between the two trajectories. 


(c) Now, we will compare the distribution of these trajectories using the Kullback— 


Leibler (KL) divergence, also known as the relative entropy. In this simple 
example, we will only compare the x coordinate of the two trajectories. 
The KL divergence between two discrete distributions P(x) and Q(x) is 
given by 


Dx(P,Q) = X P(a) log Gar (7.111) 


where ¥ is the set of discrete states. For our Lorenz example, we will 
compute a binned histogram of the x variable from —20 to 20 with bins 
of width 0.2, and this will be our discrete distribution. Compute the KL 
divergence of the two trajectories as a function of time. How does this 
trend differ from the error plot above? 


(d) Now generate two trajectories starting from the same initial condition but 
with different parameters of the Lorenz system. Compute the KL diver- 
gence of these two solutions. 
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Exercise 7-3. This exercise will explore the finite-time Lyapunov exponent (FTLE) 
as a measure of local stretching between neighboring particles in a chaotic dy- 
namical system. 


(a) First, we will generate a set of approximately equally spaced points on 
the Lorenz attractor. Generate a long-time trajectory (at least T = 100) for 
the Lorenz system, starting from a standard initial condition and using 
standard parameters. Discard the first t = 1 portion of the trajectory. Cre- 
ate a library of points, initialized with the first point on the trajectory after 
t = 1. For each subsequent point on the trajectory, add it to the library if 
and only if it is a distance of greater than ô = 0.1 from all other points in 
the library. This will be a sampling of the attractor. 


(b) Now, for each point xo in this set, find the nearest point x’ on the attractor 
and compute the difference x; = xo — x’. Simulate xp and x, = xo + €xs for 
T = 1, using € = 0.01. Repeat this for every point on the attractor. For each 
point, compute the finite-time Lyapunov exponent using the following 
formula: 


1 
og = — log 


(2 th. 


Plot each point on the attractor, color-coded by the FTLE ø. Which points 
are the most sensitive? Is this consistent with your intuition? 


(c) Repeat this experiment for various e and T and explain the results. 


(d) Finally, it is possible to create a proxy for this sensitivity using an adaptive 
step integrator, such as rk 45. Simulate a long-time trajectory with a very 
small minimum At and a small error tolerance. Plot the trajectory color- 
coded by the At selected by the integrator. Is this consistent with the FTLE 
plots above? 


Exercise 7-4. This exercise will test the data requirements for identifying an 
accurate SINDy model for the Lorenz system, following the work of Champion 
et al. [167]. First, we will use clean data without any measurement noise. 


(a) Generate a long trajectory with a fine sampling rate of At = 0.0001; dis- 
card the first t = 1 of the data. Use SINDy to generate models using an 
increasing length of data T. At what T is it possible to identify the cor- 
rect Lorenz mode? Plot the data up until T. Do the results surprise you? 
(Explain why.) 


(b) Now, repeat this experiment for different sampling rates from t = 0.0001 
to 0.1. How does the minimum data length T change? How does the num- 
ber of samples T/At change? 
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(c) Repeat the experiment with small additive Gaussian noise on the trajec- 
tory. With noise, repeat the identification for several noise realizations to 
obtain an average success rate. Also try different noise magnitudes. Re- 
peat the main figure in the Champion et al. paper. 


Exercise 7-5. Generate and plot a trajectory for the Rossler system, given by 


L=—y—Z, 
y =x + ay, 
ż=b+zļz-c), 


with parameters a = 0.2, b = 0.2, and c = 14. First, let us marvel at how a 
chaotic system can be generated with a single quadratic term (the xz term in 
the third equation). Use this trajectory to identify a SINDy model. Explore dif- 
ferent sparsifying thresholds and different trajectory lengths. Now add a small 
amount of noise to the trajectory and re-identify the model, again with differ- 
ent thresholds and lengths. When there is sufficiently long trajectory data, what 
happens when the threshold is too large? When it is too small? 


Exercise 7-6. This example will explore and compare the SINDy method and 
genetic programming to learn the equations of motion for a challenging system. 


(a) Derive the equations of motion for a double pendulum and simulate for 
an initial condition near the double-inverted configuration. 


(b) Use this data to generate a model using the SINDy method as in Kaheman 
et al. [364]. 


(c) Use this data to generate a model using the genetic programming ap- 
proach as in Schmidt and Lipson [640]. 


(d) Repeat the above experiments with and without friction. Compare the 
results and discuss. 


(e) Try using both approaches to learn conserved quantities for the double 


pendulum system in the absence of friction. Note that you will need to 
use a very accurate integrator to avoid numerical issues due to chaos. 


Exercise 7-7. This exercise will explore identifying a PDE using PDE-FIND 
based on data from the Korteweg-de Vries (KdV) system 


Ut + Uzga — GU, = 0. 
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This system gives rise to coherent traveling wave solutions, known as solitons, 
that retain their shape as they travel at a constant wave speed, despite the non- 
linearity in the system. In general, a soliton solution will take the form 


uxt) = —<sech? (See — e) ) 
2 2 
for an arbitrary positive wave speed c. Notice that the wave speed c is linked 


to the amplitude and width of the wave. 
(a) Generate data by initializing a single-soliton solution 
u(x, 0) = —4sech?°(}(£)), 


with a wave speed corresponding to c = 1 above. Try to identify a PDE 
model using PDE-FIND based on this data. What is the sparsest model 
that supports the data? 


(b) Next, use data beginning at two initial conditions. The first initial condi- 
tion is the one above with c = 1, and the second initial condition is a soli- 
ton solution with c = 4. Concatenate this data and identify a PDE model 
using PDE-FIND. What is the sparsest model that supports the data? 


(c) Discuss the two models that are identified depending on the initial data, 
and explain any differences. 


Exercise 7-8. This exercise will explore the connection between DMD and hid- 
den Markov models (HMMs). A Markov model describes the probabilities of 
transitioning from one of finitely many states to another. Typically this informa- 
tion is encoded in a transition probability matrix P that defines a probabilistic 
dynamical system 


Xp41 = Px,. 


The vector x, € R” is a vector of probabilities}| of being in one of n states at 
time-step k. One way to simulate a Markov model forward in time is to evolve 
the probabilities in x until they reach a steady state (given by the eigenvector 
of P corresponding to unit eigenvalue). Alternatively, it is possible to make 
an observation at each time-step k, whereby the state x;,,1 is chosen based on 
the probability vector Px; so that x;,; has a 1 in exactly one position and 0s 
everywhere else. This technically corresponds to a modified system 


Xk+1 = O(Px,), 


3Note that this is the transpose of the Russian standard notation for Markov models to be 
consistent with the rest of the book. 
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where the operator O is the observation operator that samples from the proba- 
bility distribution given by Px,. 

In this example, consider a Markov process determining a weather model for 
the transition between sunny, rainy, or cloudy weather, which are the three 
states in x € R’. The transition probability matrix is given by 


0 0.25 0.25 
P= |0.40 0.50 0.25 
0.60 0.25 0.50 


First, what is the long-time expected probability distribution? 


Now, simulate a random instance of this process, using the observation oper- 
ator O at every step. Create a data matrix using this process, and identify a 
DMD model. What is the structure of the model? Does it agree with the tran- 
sition matrix P? Does it satisfy conservation of probability, meaning that the 
columns each sum to 1? 


Exercise 7-9. This exercise will develop a probabilistic model for the chaotic 
dynamics in the Lorenz system, following the seminal paper by Kaiser et al. 
[367] on cluster reduced-order modeling (CROM). 


(a) First, generate a long trajectory of the Lorenz system, starting with the 
standard parameters and integrating until T = 500 with a time-step of 
At = 0.005. Next, use k-means clustering on the data to segment it into 
k = 10 clusters and plot the data, color-coded by which cluster it belongs 
to. 


(b) The CROM approach creates a k x k Markov model P for the probability 
of transitioning from one cluster to another. To compute this transition 
matrix, begin by creating a k x k matrix initialized with all zeros. Next, 
go through the trajectory data, and for every point keep track of which 
cluster the point belongs to and what cluster the next point belongs to. If 
the current point belongs to cluster j and the next point belongs to cluster 
i, add a 1 to the P;; location in the matrix. After all transitions from the 
entire trajectory have been recorded, normalize each column of P by the 
sum of the column, so that all columns add up to 1. 


(c) Now simulate the evolution of the CROM model starting with the final 
data point at T = 500 and plot the evolution. How does this compare with 
a trajectory of the Lorenz system initialized at this same location? 


(d) Reproduce Figs. 4 and 5 from the Kaiser et al. [367] paper and explain 
these results. 
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(e) Now, repeat the above with different cluster sizes k. Do the results im- 
prove or worsen for fewer clusters? For more clusters? 
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Chapter 8 


Linear Control Theory 


The focus of this book has largely been on characterizing complex systems 
through dimensionality reduction, sparse sampling, and dynamical systems 
modeling. However, an overarching goal for many systems is the ability to ac- 
tively manipulate their behavior for a given engineering objective. The study 
and practice of manipulating dynamical systems is broadly known as control 
theory, and it is one of the most successful fields at the interface of applied 
mathematics and practical engineering. Control theory is inseparable from data 
science, as it relies on sensor measurements (data) obtained from a system to 
achieve a given objective. In fact, control theory deals with living data, as suc- 
cessful application modifies the dynamics of the system, thus changing the 
characteristics of the measurements. Control theory forces the reader to con- 
front reality, as simplifying assumptions and model approximations are tested. 

Control theory has helped shape the modern technological and industrial 
landscape. Examples abound, including cruise control in automobiles, posi- 
tion control in construction equipment, fly-by-wire autopilots in aircraft, in- 
dustrial automation, packet routing in the Internet, commercial HVAC (heat- 
ing, ventilation, and air-conditioning) systems, stabilization of rockets, and PID 
(proportional—integral—derivative) temperature and pressure control in mod- 
ern espresso machines, to name only a few of the many applications. In the 
future, control will be increasingly applied to high-dimensional, strongly non- 
linear and multi-scale problems, such as turbulence, neuroscience, finance, epi- 
demiology, autonomous robots, and self-driving cars. In these future applica- 
tions, data-driven modeling and control will be vitally important; this is the 
subject of Chapters|7|and 

This chapter will introduce the key concepts from closed-loop feedback con- 
trol. The goal is to build intuition for how and when to use feedback con- 
trol, motivated by practical real-world challenges. Most of the theory will be 
developed for linear systems, where a wealth of powerful techniques exist 
[665]. This theory will then be demonstrated on simple and intuitive ex- 
amples, such as to develop a cruise controller for an automobile or to stabilize 
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an inverted pendulum on a moving cart. Code will be provided in MATLAB 
and Python. Historically, control was typically implemented in MATLAB be- 
cause of the extensive control toolboxes and functionality. However, advanced 
control is now possible in Python through the Python Control Systems Library 
(python-cont rol), available atfhttps://python-control.readthedocb. 
This toolbox provides, among other functionality, Python wrappers for 
the same SLICOT optimization libraries that are used in MATLAB’s con- 
trol toolboxes. In all of the Python codes, it is assumed that the following is 
added to the preamble: 


from control.matlab import x 
import slycot 


This will make the Python code very similar to, and in some cases identical 
to, the corresponding MATLAB code. If code is not duplicated for MATLAB 
and Python, then it may be assumed that the python-control implementation is 
nearly identical. 


Types of Control 


There are many ways to manipulate the behavior of a dynamical system, and 
these control approaches are organized schematically in Fig.|8.1| Passive control 
does not require input energy, and, when sufficient, it is desirable because of its 
simplicity, reliability, and low cost. For example, stop signs at a traffic intersec- 
tion regulate the flow of traffic. Active control requires input energy, and these 
controllers are divided into two broad categories based on whether or not sen- 
sors are used to inform the controller. In the first category, open-loop control 
relies on a pre-programmed control sequence; in the traffic example, signals 
may be pre-programmed to regulate traffic dynamically at different times of 
day. In the second category, active control uses sensors to inform the control 
law. Disturbance feedforward control measures exogenous disturbances to the 
system and then feeds this into an open-loop control law; an example of feed- 
forward control would be to pre-emptively change the direction of the flow of 
traffic near a stadium when a large crowd of people are expected to leave. Fi- 
nally, the last category is closed-loop feedback control, which will be the main 
focus of this chapter. Closed-loop control uses sensors to measure the system 
directly and then shapes the control in response to whether the system is actu- 
ally achieving the desired goal. Many modern traffic systems have smart traffic 
lights with a control logic informed by inductive sensors in the roadbed that 
measure traffic density. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


378 CHAPTER 8. LINEAR CONTROL THEORY 


Figure 8.1: Schematic illustrating the various types of control. Most of this chap- 
ter will focus on closed-loop feedback control. 


8.1 Closed-Loop Feedback Control 


The main focus of this chapter is closed-loop feedback control, which is the 
method of choice for systems with uncertainty, instability, and/or external dis- 
turbances. Figure depicts the general feedback control framework, where 
sensor measurements, y, of a system are fed back into a controller, which then 
decides on an actuation signal, u, to manipulate the dynamics and provide ro- 
bust performance despite model uncertainty and exogenous disturbances. In 


all of the examples discussed in this chapter, the vector of exogenous distur- 


T 
bances may be decomposed as w = |w] w? w,| , where wy are distur- 


bances to the state of the system, w,, is measurement noise, and w, is a reference 
trajectory that should be tracked by the closed-loop system. 

Mathematically, the system and measurements are typically described by a 
dynamical system: 


“x = f(x, u, wa), (8.1a) 
y = g(x, U, Wn). (8.1b) 
The goal is to construct a control law, 
u = k(y, wy), (8.2) 
that minimizes a cost function, 
J £ J(x,u,w,). (8.3) 
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Cost 


Disturbances 
W 


Actuators Sensors 


Figure 8.2: Standard framework for feedback control. Measurements of the sys- 
tem, y(t), are fed back into a controller, which then decides on the appropriate 
actuation signal u(t) to control the system. The control law is designed to mod- 
ify the system dynamics and provide good performance, quantified by the cost 
J, despite exogenous disturbances and noise in w. The exogenous input w may 
also include a reference trajectory w, that should be tracked. 


Thus, modern control relies heavily on techniques from optimization [101]. In 
general, the controller in (8.2) will be a dynamical system, rather than a static 
function of the inputs. For example, the Kalman filter in Section[8.5|dynamically 
estimates the full state x from measurements of u and y. In this case, the control 
law will become u = k(y,x, w,), where x is the full-state estimate. 

To motivate the added cost and complexity of sensor-based feedback con- 
trol, it is helpful to compare with open-loop control. For reference tracking 
problems, the controller is designed to steer the output of a system towards 
a desired reference output value w,, thus minimizing the error € = y — w,. 
Open-loop control, shown in Fig. uses a model of the system to design 
an actuation signal u that produces the desired reference output. However, this 
pre-planned strategy cannot correct for external disturbances to the system and 
is fundamentally incapable of changing the dynamics. Thus, it is impossible to 
stabilize an unstable system, such as an inverted pendulum, with open-loop 
control, since the system model would have to be known perfectly and the 
system would need to be perfectly isolated from disturbances. Moreover, any 
model uncertainty will directly contribute to open-loop tracking error. 

In contrast, closed-loop feedback control, shown in Fig. uses sensor 
measurements of the system to inform the controller about how the system 
is actually responding. These sensor measurements provide information about 
unmodeled dynamics and disturbances that would degrade the performance 
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| Wa Wn 
WwW, u + + y 
—————— Controller System i 


Figure 8.3: Open-loop control diagram. Given a desired reference signal w,, the 
open-loop control law constructs a control protocol u to drive the system based 
on a model. External disturbances (w4) and sensor noise (w,,), as well as un- 
modeled system dynamics and uncertainty, are not accounted for and degrade 
performance. 


Feedback signal 


Figure 8.4: Closed-loop feedback control diagram. The sensor signal y is fed 
back and subtracted from the reference signal w,, providing information about 
how the system is responding to actuation and external disturbances. The con- 
troller uses the resulting error e to determine the correct actuation signal u 
for the desired response. Feedback is often able to stabilize unstable dynam- 
ics while effectively rejecting disturbances w4 and attenuating noise wn. 


in open-loop control. Further, with feedback it is often possible to modify and 
stabilize the dynamics of the closed-loop system, something that is not pos- 
sible with open-loop control. Thus, closed-loop feedback control is often able 
to maintain high-performance operation for systems with unstable dynamics, 
model uncertainty, and external disturbances. 


Examples of the Benefits of Feedback Control 


To summarize, closed-loop feedback control has several benefits over open- 
loop control: 


e It may be possible to stabilize an unstable system. 
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e It may be possible to compensate for external disturbances. 


e It may be possible to correct for unmodeled dynamics and model uncer- 
tainty. 


These issues are illustrated in the following two simple examples. 


Inverted Pendulum. Consider the unstable inverted pendulum equations, 
which will be derived later in Section]8.2} The linearized equations are 


d Tı 0 1 Tı 0 

— = A 

dt [l o L i H ü h À aa 
where zı = 0, £2 = 6, u is a torque applied to the pendulum arm, g is gravita- 


tional acceleration, L is the length of the pendulum arm, and d is damping. We 
may write this system in standard form as 


Íx = Ax + Bu. 


If we choose constants so that the natural frequency is wn = \/g/L = 1 and 
d = 0, then the system has eigenvalues à = +1, corresponding to an unstable 
saddle-type fixed point. 

No open-loop control strategy can change the dynamics of the system, given 
by the eigenvalues of A. However, with full-state feedback control, given by 
u = —Kx, the closed-loop system becomes 


“x = Ax + Bu = (A — BK)x. 

Choosing K = [4 4], corresponding to a control law u = —4x;—42 = —40—46, 

the closed-loop system (A — BK) has stable eigenvalues à = —1 and à = —3. 
Determining when it is possible to change the eigenvalues of the closed- 

loop system, and determining the appropriate control law K to achieve this, 

will be the subject of future sections. 


Cruise Control. To appreciate the ability of closed-loop control to compensate 
for unmodeled dynamics and disturbances, we will consider a simple model of 
cruise control in an automobile. Let u be the rate of fuel fed into the engine, and 
let y be the car’s speed. Neglecting transients, a crude model] is 


y=u. (8.5) 


1A more realistic model would have acceleration dynamics, so that ¢ = —x + u and y = z. 
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Thus, if we double the gas input, we double the automobile’s speed. 

Based on this model, we may design an open-loop cruise controller to track 
a reference speed w, by simply commanding an input of u = w,. However, an 
incorrect automobile model (i.e., in actuality y = 2u), or external disturbances, 
such as rolling hills (i.e., if y = u + sin(t)), are not accounted for in the simple 
open-loop design. 

In contrast, a closed-loop control law, based on measurements of the speed, 
is able to compensate for unmodeled dynamics and disturbances. Consider the 
closed-loop control law u = K(w, — y), so that gas is increased when the mea- 
sured velocity is too low, and decreased when it is too high. Then if the dynam- 
ics are actually y = 2u instead of y = u, the open-loop system will have 50% 
steady-state tracking error, while the performance of the closed-loop system 
can be significantly improved for large K: 


2K 


= 2K (w, — =? 14+2K)y=2ku, => = bas 
y (wr — y) (1+2K)y w Y= 7K” 


(8.6) 


For K = 50, the closed-loop system only has 1% steady-state tracking error. 
Similarly, an added disturbance w4 will be attenuated by a factor of 1/(2K +1). 

As a concrete example, consider a reference tracking problem with a de- 
sired reference speed of 60 mph miles per hour). The model is y = u, and the 
true system is y = 0.5u. In addition, there is a disturbance in the form of rolling 
hills that increase and decrease the speed by +10 mph at a frequency of 0.5 Hz. 
An open-loop controller is compared with a closed-loop proportional controller 
with K = 50 in Fig.|8.5)and Code Although the closed-loop controller has 
significantly better performance, we will see later that a large proportional gain 
may come at the cost of robustness. Adding an integral term will improve per- 
formance. 


Code 8.1: [MATLAB] Compare open-loop and closed-loop cruise control. 
E ORT OIO; % time 


wr = 60xones (size (t)); 
d = 10*sin(pixt); 


reference speed 
disturbance 


o ole 


aModel = 1; 3% y = aModelxu 

aTrue = .5; 3% y = aTrue*u 

uOL = wr/aModel; 3% Open-loop u based on model 
yOL = aTruexuOL + d; 3% Open-loop response 


K = 50; % control gain, u=K(wr-y); 
yCL = aTrue*K/ (1l+aTrue*K) *wr + d/(1ltaTruesxk) ; 
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Figure 8.5: Open-loop versus closed-loop cruise control. 


Code 8.1: [Python] Compare open-loop and closed-loop cruise control. 


t = np.arange(0,10,0.01) 


wr = 60 x np.ones_like(t) 
d = 10*np.sin(np.pixt) 


aModel = 1 

aTrue = 0.5 

uOL = wr/aModel 

yOL = aTruexuOL + d 
K = 50 

vele = 


# time 


# reference speed 
# disturbance 


i 


aModel «u 


Ay 
# y aTruex*u 


i 


# Open-loop u based on model 
# Open-loop response 


# control gain, u=K(wr-y) 


(aTruexK/ (1l+aTrue*K)) «wr + d/(1l+taTruexK) 


8.2 Linear Time-Invariant Systems 


The most complete theory of control has been developed for linear systems 
[80] (222) [665]. Linear systems are generally obtained by linearizing a nonlinear 
system about a fixed point or a periodic orbit. However, instability may quickly 
take a trajectory far away from the fixed point. Fortunately, an effective stabi- 
lizing controller will keep the state of the system in a small neighborhood of the 
fixed point where the linear approximation is valid. For example, in the case of 
the inverted pendulum, feedback control may keep the pendulum stabilized in 
the vertical position where the dynamics behave linearly. 
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Linearization of Nonlinear Dynamics 
Given a nonlinear input-output system 


d 
ax = f(x, u), (8.7a) 


y= g(x, u), (8.7b) 


it is possible to linearize the dynamics near a fixed point (x, u) where f(x, u) = 0. 
For small Ax = x — x and Au = u — u, the dynamics f may be expanded in a 
Taylor series about the point (x, ū) as 


df 


f(x + Ax, ū + Au) = f(x, a) + — -Ax + — -Aut-:-. (8.8) 
dX | (3,0) dU | 0) 
— — 
A B 
Similarly, the output equation g may be expanded as 
. e -epg 8 dg 
g(x + Ax, ū + Au) = g(x,u) + — -Ax+ = Aut.. (8.9) 
dX (3,0) dU |æ) 
—_—” —_—” 
Cc D 


For small displacements around the fixed point, the higher-order terms are neg- 

ligibly small. Dropping the A and shifting to a coordinate system where x, U, 

and y are at the origin, the linearized dynamics may be written as 

d 

ae Ax + Bu, (8.10a) 
y = Cx+ Du. (8.10b) 


Note that we have neglected the disturbance and noise inputs, w4 and w,,, re- 
spectively; these will be added back in the discussion on Kalman filtering in 
Section 


Unforced Linear System 


In the absence of control (i.e., u = 0), and with measurements of the full state 
(i.e., y = x), the dynamical system in (8.10) becomes 


d 
at = Ax. (8.11) 
The solution x(t) is given by 
x(t) = e@*'x(0), (8.12) 
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where the matrix exponential is defined by 


A’. Are 
At _ | | | [tee 
eM SIHA +e. (8.13) 


The solution in (8.12) is determined entirely by the eigenvalues and eigenvec- 
tors of the matrix A. Consider the eigendecomposition of A: 


AT = TA, (8.14) 


In the simplest case, A is a diagonal matrix of distinct eigenvalues and T is a 
matrix whose columns are the corresponding linearly independent eigenvec- 
tors of A. For repeated eigenvalues, A may be written in Jordan form, with 
entries above the diagonal for degenerate eigenvalues of multiplicity > 2; the 
corresponding columns of T will be generalized eigenvectors. 

In either case, it is easier to compute the matrix exponential e^ than e^. For 
diagonal A, the matrix exponential is given by 


et 0 0 
0 @8 avs 0 

eela « « «12 (8.15) 
0 0 wee EAnt 


In the case of a non-trivial Jordan block in A with entries above the diago- 
nal, simple extensions exist related to nilpotent matrices (for details, see Perko 
[562]). 

Rearranging the terms in (8.14), we find that it is simple to represent powers 
of A in terms of the eigenvectors and eigenvalues: 


A= TAT, (8.16a) 
A? = (TAT )(TAT )=TAT, (8.16b) 
A* = (TAT (TAT) ses (TAT) = TAT. (8.16c) 


Finally, substituting these expressions into (8.13) yields 


TAT-12 TAT! 


eñt E erAT it = TT! ae TATH $ aI l I fee. (8.17a) 
Re KPB En 

= T I T At T 2| T 3] ARR T (8.17b) 

= Te* T. (8.17c) 
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Thus, we see that it is possible to compute the matrix exponential efficiently 
in terms of the eigendecomposition of A. Moreover, the matrix of eigenvectors 
T defines a change of coordinates that dramatically simplifies the dynamics: 


x=Tz = ż=Tx= T Ax= T HATZ? = ż= Az. (8.18) 


In other words, changing to eigenvector coordinates, the dynamics become di- 
agonal. Combining (8.12) with (8.17c), it is possible to write the solution x(t) 
as 


x(t) = Te" T™x(0). (8.19) 
(0) 


In the first step, T~' maps the initial condition in physical coordinates, x(0), 
into eigenvector coordinates, z(0). The next step advances these initial condi- 
tions using the diagonal update e^, which is considerably simpler in eigen- 
vector coordinates z. Finally, multiplying by T maps z(t) back to physical coor- 
dinates, x(t). 

In addition to making it possible to compute the matrix exponential, and 
hence the solution x(t), the eigendecomposition of A is even more useful to 
understand the dynamics and stability of the system. We see from that 
the only time-varying portion of the solution is e^. In general, these eigenval- 
ues à = a + ib may be complex numbers, so that the solutions are given by 
e% = e% (cos(bt) + isin(bt)). Thus, if all of the eigenvalues ;, have negative real 
part (i.e., Re(\) = a < 0), then the system is stable, and solutions all decay to 
x = Oast — oo. However, if even a single eigenvalue has positive real part, 
then the system is unstable and will diverge from the fixed point along the 
corresponding unstable eigenvector direction. Any random initial condition is 
likely to have a component in this unstable direction, and, moreover, distur- 
bances will likely excite all eigenvectors of the system. 


Forced Linear System 
With forcing, and for zero initial condition, x(0) = 0, the solution to (8.10a) is 
t 
x(t) = f eA Bu(r) dr £ eB x u(t). (8.20) 
0 
The control input u(t) is convolved with the kernel e®B. With an output y = 
Cx, we have y(t) = Ce“‘B x u(t). This convolution is illustrated in Fig. [8.6]for 
a single-input, single-output (SISO) system in terms of the impulse response 


g(t) = Ce™“B = IN Ce^t-)B6(7) dr given a Dirac delta input u(t) = d(t). 
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Figure 8.6: Convolution for a single-input, single-output (SISO) system. 


Discrete-Time Systems 


In many real-world applications, systems are sampled at discrete instants in 
time. Thus, digital control systems are typically formulated in terms of discrete- 
time dynamical systems: 


Xk+1 = Agx, + Baur, (8.21a) 
Yr = Cax, + Daug, (8.21b) 
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Continuous time Discrete time 

Figure 8.7: The matrix exponential defines a conformal map on the complex 
plane, mapping stable eigenvalues in the left half-plane into eigenvalues inside 
the unit circle. 


where x, = x(kAt). The system matrices in (8.21) can be obtained from the 
continuous-time system in (8.10) as 


Ag = eA*, (8.22a) 
At 
Bu = f e“"Bdr, (8.22b) 
0 
Ci =C, (8.22c) 
D, =D. (8.22d) 


The stability of the discrete-time system in (8.21) is still determined by the 
eigenvalues of Ay, although now a system is stable if and only if all discrete- 
time eigenvalues are inside the unit circle in the complex plane. Thus, exp(A At) 
defines a conformal mapping on the complex plane from continuous time to 
discrete time, where eigenvalues in the left half-plane map to eigenvalues in- 
side the unit circle. 


Example: Inverted Pendulum 


Consider the inverted pendulum in Fig. [8.8] with a torque input u at the base. 
The equation of motion, derived using the Euler-Lagrange equations/| is 


6 = —=sin(6) + u. (8.23) 


The Lagrangian is £ = (m/2)L762 — mgLcos(@), and the Euler-Lagrange equation is 
(d/dt)0£L/00 — O£L/00 = rT, where 7 is the input torque. 
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Figure 8.8: Schematic of inverted pendulum system. 


Introducing the state x, given by the angular position and velocity, we can write 
this second-order differential equation as a system of first-order equations: 


Ty) 0 d [z 7 Lo 
Taking the Jacobian of f(x, u) yields 
df 0 1 daf fo 
dx Ear cos(x) J i ua H . (8.25) 


Linearizing at the pendulum-up (xı = 7, £2 = 0) and pendulum-down (x; = 0, 
x2 = 0) equilibria gives 


pendulum up, A=+/g/L pendulum down, A=+%4/g/L 


Thus, we see that the down position is a stable center with eigenvalues \ = 
+i,\/g/L corresponding to oscillations at a natural frequency of \/g/L. The 
pendulum-up position is an unstable saddle with eigenvalues à = +,/g/L. 


8.3 Controllability and Observability 


A natural question arises in linear control theory: To what extent can closed- 
loop feedback u = —Kx manipulate the behavior of the system in (8.10a)? We 
already saw in Section |8.1) that it was possible to modify the eigenvalues of 
the unstable inverted pendulum system via closed-loop feedback, resulting in 
a new system matrix (A — BK) with stable eigenvalues. This section will pro- 
vide concrete conditions on when and how the system dynamics may be ma- 
nipulated through feedback control. The dual question, of when it is possible 
to estimate the full state x from measurements y, will also be addressed. 
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Controllability 


The ability to design the eigenvalues of the closed-loop system with the choice 
of K relies on the system in being controllable. The controllability of a 
linear system is determined entirely by the column space of the controllability 
matrix C: 

C=(B AB A’B .-- A” '!B]. (8.26) 


If the matrix C Het n Sa) inc, independent columns, so that it spans all of R”, 
then the system in (8.10a) is controllable. The span of the columns of the control- 
lability matrix C Phe a Krylov subspace that determines which state vector 
directions in R” may be manipulated with control. Thus, in addition to control- 
lability implying arbitrary eigenvalue placement, it also implies that any state 
€ € R” is reachable in a finite time with some actuation signal u(t). 

The following three conditions are equivalent: 


(a) Controllability. The span of C is R”. The matrix C may be generated by 


|| >> ctrb (A,B) 


and the rank may be tested to see if it is equal to n by 


|| >> rank (ctrb (A,B) ) 


In Python, the rank is computed by 


|| >>> numpy.linalg.matrix rank (ctro (A,B) ) 


(b) Arbitrary eigenvalue placement. It is possible to design the eigenvalues of 
the closed-loop system through choice of feedback u = —Kx: 


d 
7x = Ax + Bu = (A - BK)x. (8.27) 


Given a set of desired eigenvalues, the gain K can be determined by 


|| >> K = place (A,B, neweigs) 


Designing K for the best performance will be discussed in Section|8.4| 


(c) Reachability of R”. It is possible to steer the system to any arbitrary state 
x(t) = € € R” ina finite time with some actuation signal u(t). 


Note that reachability also applies to open-loop systems. In particular, if a di- 
rection € is not in the span of C, then it is impossible for control to push in this 
direction in either open-loop or closed-loop systems. 
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Examples. The notion of controllability is more easily understood by investi- 
gating a few simple examples. First, consider the following system: 


d |z 1 O} |z 0 0 0 

ae al fel+[}s = esha 6% 
This system is not controllable, because the controllability matrix C consists of 
two linearly dependent vectors and does not span R°. Even before checking 
the rank of the controllability matrix, it is easy to see that the system will not 
be controllable since the states xı and x2 are completely decoupled and the 
actuation input u only affects the second state. 


Modifying this example to include two actuation inputs makes the system 
controllable by increasing the control authority: 


d fe] — [1 0] fe] [1 0] fu _f1 010 

dt a = b j H + j i ae aN a OL | eee 
This fully actuated system is clearly controllable because x, and x may be inde- 
pendently controlled with u; and ug. The controllability of this system is con- 
firmed by checking that the columns of C do span R?. 


The most interesting cases are less obvious than these two examples. Con- 
sider the system 


d Uy) 1 1 Uy 0 = 0 1 

aelh delie = eE e 
This two-state system is controllable with a single actuation input because the 
states xı and x2 are now coupled through the dynamics. Similarly, 


d | ay 1 O} |z 1 1 1 
tlel-b dlelth|« = eha 6a 


is controllable even though the dynamics of xı and xz are decoupled, because 


the actuator B = |1 ie is able to simultaneously affect both states and they 
have different timescales. 

We will see in Section that controllability is intimately related to the 
alignment of the columns of B with the eigenvector directions of A. 


Observability 


Mathematically, observability of the system in is nearly identical to con- 
trollability, although the physical interpretation differs somewhat. A system is 
observable if it is possible to estimate any state € € R” from a time history of the 
measurements y(t). 
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Again, the observability of a system is entirely determined by the row space 
of the observability matrix O: 


C 
CA 


o-| CA? |. (8.32) 


ca" 


In particular, if the rows of the matrix O span R”, then it is possible to estimate 
any full-dimensional state x € R” from the time history of y(t). The matrix © 
may be generated by 


|| >> obsv (A, C) 


The motivation for full-state estimation is relatively straightforward. We 
have already seen that, with full-state feedback, u = —Kx, it is possible to 
modify the behavior of a controllable system. However, if full-state measure- 
ments of x are not available, it is necessary to estimate x from the measurements. 
This is possible when the system is observable. In Section|8.5} we will see that 
it is possible to design an observer dynamical system to estimate the full state 
from noisy measurements. As in the case of a controllable system, if a system is 
observable, it is possible to design the eigenvalues of the estimator dynamical 
system to have desirable characteristics, such as fast estimation and effective 
noise attenuation. 

Interestingly, the observability criterion is mathematically the dual of the 
controllability criterion. In fact, the observability matrix is the transpose of the 
controllability matrix for the pair (AT, CT): 


|| >> Gy PS eero ATCO a obs ls, dual jee n CrEDI 


The PBH Test for Controllability 


There are many tests to determine whether or not a system is controllable. One 
of the most useful and illuminating is the Popov-Belevitch-Hautus (PBH) test. 
The PBH test states that the pair (A, B) is controllable if and only if the column 
rank of the matrix |(A — AI) B] is equal to n for all A € C. This test is partic- 
ularly fascinating because it connects controllability] to a relationship between 
the columns of B and the eigenspace of A. 

First, the PBH test only needs to be checked at \ that are eigenvalues of A, 
since the rank of A— AI is equal to n except when 4 is an eigenvalue of A. In fact, 


(A - 


3There is an equivalent PBH test for observability that states that | “l must have row 


rank n for all A € C for the system to be observable. 
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the characteristic equation det(A — AI) = 0 is used to determine the eigenvalues 
of A as exactly those values where the matrix A — AI becomes rank-deficient or 
degenerate. 

Now, given that (A — AI) is only rank-deficient for eigenvalues åA, it also fol- 
lows that the null space, or kernel, of A — AI is given by the span of the eigen- 
vectors corresponding to that particular eigenvalue. Thus, for [(A — AI) B] to 
have rank n, the columns in B must have some component in each of the eigen- 
vector directions associated with A to complement the null space of A — AI. 

If A has n distinct eigenvalues, then the system will be controllable with a 
single actuation input, since the matrix A —AI will have at most one eigenvector 
direction in the null space. In particular, we may choose B as the sum of all 
of the n linearly independent eigenvectors, and it will be guaranteed to have 
some component in each direction. It is also interesting to note that if B is a 
random vector (>>B=randn(n,1);), then (A, B) will be controllable with high 
probability, since it will be exceedingly unlikely that B will be randomly chosen 
so that it has zero contribution from any given eigenvector. 

If there are degenerate eigenvalues with multiplicity > 2, so that the null 
space of A — AI is multi-dimensional, then the actuation input must have as 
many degrees of freedom. In other words, the only time that multiple actuators 
(columns of B) are strictly required is for systems that have degenerate eigen- 
values. However, if a system is highly non-normal, it may be helpful to have 
multiple actuators in practice for better control authority. Such non-normal sys- 
tems are characterized by large transient growth due to destructive interference 
between nearly parallel eigenvectors, often with similar eigenvalues. 


The Cayley—Hamilton Theorem and Reachability 


To provide insight into the relationship between the controllability of the pair 
(A,B) and the reachability of any vector € € R” via the actuation input u(t), 
we will leverage the Cayley-Hamilton theorem. This is a gem of linear algebra 
that provides an elegant way to represent solutions of x = Ax in terms of a 
finite sum of powers of A, rather than the infinite sum required for the matrix 
exponential in (8.13). 

The Cayley—Hamilton theorem states that every matrix A satisfies its own 
characteristic (eigenvalue) equation, det(A — AI) = 0: 


det(A — AT) = A” + agi A" 1 +--+ + apd? + aÀ + a = 0 (8.33a) 
=> A” + Gn-1 A” 1 +- H aA? HA H aol = 0. (8.33b) 


Although this is relatively simple to state, it has profound consequences. In 
particular, it is possible to express A” as a linear combination of smaller powers 
of A: 

A” = —aol = a,A = aA?’ Se apn ee. (8.34) 
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It is straightforward to see that this also implies that any higher power A*=”" 
may also be expressed as a sum of the matrices {I, A,..., A"~'}: 


MEP aA, (8.35) 


Thus, it is possible to express the infinite sum in the exponential e^ as 


sa 2 
At_T4+ At+ + (8.36a) 


= Bo(t)I + Bilt NA + ba(t) A? +--+ + Baa (t) AT. (8.36b) 


We are now equipped to see how controllability relates to the reachability 
of an arbitrary vector € € R”. From (8.20), we see that a state € is reachable if 
there is some u(t) so that 


t 
E= f e^t- Bu(r) dr. (8.37) 
0 
Expanding the exponential on the right-hand side in terms of (8.36b), we have 


ee mane eas 
<- + Balt — T)A” 'Bu(r)| dr 
=B | Bat- rym ar AB | Bit- r)u eee 


+ ame f Bn-1(t — T)u(r) dr 
0 


[ Bolt — T)u(r) dr 


- [B AB... A™'B] f et- nyatr) ar 


[ Bn-1(t — T)u(r) dr 


Note that the matrix on the left is the controllability matrix C, and we see that 
the only way that all of R” is reachable is if the column space of C spans all of 
R”. It is somewhat more difficult to see that if C has rank n then it is possible to 
design a u(t) to reach any arbitrary state € € R”, but this relies on the fact that 
the n functions {/5;(t) }"=5 are linearly independent functions. It is also the case 
that there is not a unique actuation input u(t) to reach a given state €, as there 
are many different paths one may take. 
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Gramians and Degrees of Controllability/Observability 


The previous tests for controllability and observability are binary, in the sense 
that the rank of C (respectively, O) is either n, or it is not. However, there are 
degrees of controllability and observability, as some states x may be easier to 
control or estimate than others. 

To identify which states are more or less controllable, one must analyze the 
eigendecomposition of the controllability Gramian: 


t 
W.(t) = | e“"BB*e*" dr. (8.38) 
0 


Similarly, the observability Gramian is given by 


t 
W(t) = f e^ TC*Ce^ dr. (8.39) 
0 
These Gramians are often evaluated at infinite time, and, unless otherwise stated, 
we refer to W. = lim: ... W(t) and W, = lim;_,.. W.(t). 

The controllability of a state x is measured by x*W_x, which will be larger 
for more controllable states. If the value of x*W x is large, then it is possible 
to navigate the system far in the x direction with a unit control input. The ob- 
servability of a state is similarly measured by x*W,x. Both Gramians are sym- 
metric and positive semi-definite, having non-negative eigenvalues. Thus, the 
eigenvalues and eigenvectors may be ordered hierarchically, with eigenvectors 
corresponding to large eigenvalues being more easily controllable or observ- 
able. In this way, the Gramians induce a new inner product over state space in 
terms of the controllability or observability of the states. 

Gramians may be visualized by ellipsoids in state space, with the principal 
axes given by directions that are hierarchically ordered in terms of controlla- 
bility or observability. An example of this visualization is shown in Fig. 
in Chapter g In fact, Gramians may be used to design reduced-order mod- 
els for high-dimensional systems. Through a balancing transformation, a key 
subspace is identified with the most jointly controllable and observable modes. 
These modes then define a good projection basis to define a model that captures 
the dominant input-output dynamics. This form of balanced model reduction 
will be investigated further in Section|9.2} 

Gramians are also useful to determine the minimum-energy control u(t) 
required to navigate the system to x(t,;) at time tr from x(0) = 0: 


u(t) = B* (eA -0 W(t) x(t). (8.40) 


The total energy expended by this control law is given by 
ts 
f || u(r) ||? dr = x*W. (ty) x. (8.41) 
0 
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It can now be seen that if the controllability matrix is nearly singular, then there 
are directions that require extreme actuation energy to manipulate. Conversely, 
if the eigenvalues of W, are all large, then the system is easily controlled. 

It is generally impracticable to compute the Gramians directly using 
and (8.39). Instead, the controllability Gramian is the solution to the following 
Lyapunov equation, 


AW., + W,.A* + BB* = 0, (8.42) 
while the observability Gramian is the solution to 
A*W,+W,A+C°C =0. (8.43) 


Obtaining Gramians by solving a Lyapunov equation is typically quite ex- 
pensive for high-dimensional systems [669]. Instead, Grami- 
ans are often approximated empirically using snapshot data from the direct 
and adjoint systems, as will be discussed in Section|9.2| 


Stabilizability and Detectability 


In practice, full-state controllability and observability may be too much to ex- 
pect in high-dimensional systems. For example, in a high-dimensional fluid 
system, it may be unrealistic to manipulate every minor fluid vortex; instead, 
control authority over the large, energy-containing coherent structures is often 
enough. 

Stabilizability refers to the ability to control all unstable eigenvector direc- 
tions of A, so that they are in the span of C. In practice, we might relax this def- 
inition to include lightly damped eigenvector modes, corresponding to eigen- 
values with a small, negative real part. Similarly, if all unstable eigenvectors of 
A are in the span of ©", then the system is detectable. 

There may also be states in the model description that are superfluous for 
control. As an example, consider the control system for a commercial passenger 
jet. The state of the system may include the passenger seat positions, although 
this will surely not be controllable by the pilot, nor should it be. 


8.4 Optimal Full-State Control: Linear-Quadratic Reg- 
ulator (LOR) 


We have seen in the previous sections that if (A, B) is controllable, then it is 
possible to arbitrarily manipulate the eigenvalues of the closed-loop system 
(A — BK) through choice of a full-state feedback control law u = —Kx. This 
implicitly assumes that full-state measurements are available (i.e., C = I and 
D = 0, so that y = x). Although full-state measurements are not always avail- 
able, especially for high-dimensional systems, we will show in the next section 
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that, if the system is observable, it is possible to build a full-state estimate from 
the sensor measurements. 

Given a controllable system, and either measurements of the full state or 
of an observable system with a full-state estimate, there are many choices of 
stabilizing control laws u = —Kx. It is possible to make the eigenvalues of 
the closed-loop system (A — BK) arbitrarily stable, placing them as far as de- 
sired in the left half of the complex plane. However, overly stable eigenval- 
ues may require exceedingly expensive control expenditure and might also 
result in actuation signals that exceed maximum allowable values. Choosing 
very stable eigenvalues may also cause the control system to overreact to noise 
and disturbances, much as a new driver will overreact to vibrations in the 
steering wheel, causing the closed-loop system to jitter. Over-stabilization can 
counter-intuitively degrade robustness and may lead to instability if there are 
small time delays or unmodeled dynamics. Robustness will be discussed in 
Section 

Choosing the best gain matrix K to stabilize the system without expending 
too much control effort is an important goal in optimal control. A balance must 
be struck between the stability of the closed-loop system and the aggressive- 
ness of control. It is important to take control expenditure into account (1) to 
prevent the controller from overreacting to high-frequency noise and distur- 
bances, (2) so that actuation does not exceed maximum allowed amplitudes, 
and (3) so that control is not prohibitively expensive. In particular, the cost 
function 


J(t) = / x(T)*Qx(7) + u(7)*Ru(z7) dr (8.44) 


balances the cost of effective regulation of the state with the cost of control. The 
matrices Q and R weight the cost of deviations of the state from zero and the 
cost of actuation, respectively. The matrix Q is positive semi-definite, and R is 
positive definite; these matrices are often diagonal, and the diagonal elements 
may be tuned to change the relative importance of the control objectives. 

Adding such a cost function makes choosing the control law a well-posed 
optimization problem, for which there is a wealth of theoretical and numerical 
techniques [101]. The linear-quadratic regulator (LOR) control law u = —K,x 
is designed to minimize J = liM; J(t). LOR is so-named because it is a lin- 
ear control law, designed for a linear system, minimizing a quadratic cost func- 
tion, that regulates the state of the system to lim;_,.. x(t) = 0. Because the cost 
function in (8.44) is quadratic, there is an analytical solution for the optimal 
controller gains K,, given by 


K, = R!B*X, (8.45) 
where X is the solution to an algebraic Riccati equation: 


A*X + XA - XBR 'B*'X+Q=0. (8.46) 
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Figure 8.9: Schematic of the linear—quadratic regulator (LOR) for optimal full- 
state feedback. The optimal controller for a linear system given measurements 
of the full state, y = x, is given by proportional control u = —K,-x, where K, is 
a constant-gain matrix obtained by solving an algebraic Riccati equation. 


Solving the above Riccati equation for X, and hence for K,, is numerically ro- 
bust and already implemented in many programming languages [76] |430]. The 
gain matrix K, is obtained via 


|| >> he = Doria a, On 
in MATLAB and via 
|| >>> Ke =] Jeo (eo; 8) 10) 


in python-control. However, solving the Riccati equation scales as O(n’) in the 
state dimension n, making it prohibitively expensive for large systems or for 
online computations for slowly changing state equations or linear parameter 
varying (LPV) control. This motivates the development of reduced-order mod- 
els that capture the same dominant behavior with many fewer states. Control- 
oriented reduced-order models will be developed more in Chapter 

The LOR controller is shown schematically in Fig. Out of all possible 
control laws u = K(x), including nonlinear controllers, the LOR controller 
u = —K,x is optimal, as we will show in Section [8.4] However, it may be the 
case that a linearized system is linearly uncontrollable while the full nonlinear 
system in is controllable with a nonlinear control law u = K(x). 


Derivation of the Riccati Equation for Optimal Control 


It is worth taking a theoretical detour here to derive the Riccati equation in 
for the problem of optimal full-state regulation. This derivation will pro- 
vide an example of how to solve convex optimization problems using the calcu- 
lus of variations, and it will also provide a template for computing the optimal 
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control solution for nonlinear systems. Because of the similarity of optimal con- 
trol to the formulation of Lagrangian and Hamiltonian classical mechanics in 
terms of the variational principal, we adopt similar language and notation. 

First, we will add a terminal cost to our LQR cost function in (8.44), and also 
introduce a factor of 1/2 to simplify computations: 


tf 
J= f $(x*Qx + u*Ru) dr + $x(ts)*Qyx(ty) (8.47) 
aaa aaa 
ý Lagrangian, £ terminal cost 


The goal is to minimize the quadratic cost function J subject to the dynamical 
constraint 
x = Ax + Bu. (8.48) 


We may solve this using the calculus of variations by introducing the fol- 
lowing augmented cost function: 


= f L(x" Qx + u"Ru) + A*(Ax + Bu — ž)] dr + Ex(ts)"Qyx(t,). (8.49) 


The variable A is a Lagrange multiplier, called the co-state, that enforces the 
dynamic constraints; A may take any value and Jaug = J will hold. 
Taking the total variation of Jaug in (8.49) yields 


“oL OL : : oe 
ô Jaug = f — óx + — ôu + A*Adx + A*Bou — A*OX} dr + Qpx(tp)dx(tp). 
0 Ox Ou 
(8.50) 
The partial derivatives"|of the Lagrangian are 0£/0x = x*Q and 0£/0u = u*R. 
The last term in the integral may be modified using integration by parts: 


tr 


ts 
-f 6x dr = —A* (ts)ôx(ts) + A*(0)6x(0) + f A*“ôx dr. 
0 0 


The term A*(0)5x(0) is equal to zero, or else the control system would be non- 
causal (i.e., then future control could change the initial condition of the system). 

Finally, the total variation of the augmented cost function in (8.50) simplifies 
as follows: 


tr : tp 
OJaug = f (x*Q F NA + A*)ox dt + f (u*R + A*B)ou dr 
0 0 


+ (xtA Q; = A* (ty) )dx(ty). (8.51) 


“The derivative of a matrix expression Ax with respect to x is A, and the derivative of x*A 
with respect to x is A*. 
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Each variation term in (8.51) must equal zero for an optimal control solution 
that minimizes J. Thus, we may break this up into three equations: 


x*Q+AA+ A =O, (8.52a) 
u*R + å*B = 0, (8.52b) 


Note that the constraint in represents an initial condition for the reverse- 
time equation for A starting at t;. Thus, the dynamics in (8.48) with initial con- 
dition x(0) = xo and with the final-time condition A(t) = Qyx(t-) form 
a two-point boundary value problem. This may be integrated numerically to 
find the optimal control solution, even for nonlinear systems. 

Because the dynamics are linear, it is possible to posit the form A = Px, and 
substitute into above. The first equation becomes: 


(Px + Px)“ + x*Q +AA =0. 


Taking the transpose, and substituting (8.48) in for x, yields 
Px + P(Ax + Bu) + Qx + A*Px = 0. 


From (8.52b), we have 


u = -R 'B*\ = -R !B*Px. 
Finally, combining yields 
Px + PAx + A*Px — PBR !B*Px + Qx = 0. (8.53) 


This equation must be true for all x, and so it may also be written as a matrix 
equation. Dropping the terminal cost and letting time go to infinity, the P term 
disappears, and we recover the algebraic Riccati equation: 


PA + AP* — PBR'B*P + Q=0. 


Although this procedure is somewhat involved, each step is relatively straight- 
forward. In addition, the dynamics in may be replaced with nonlinear dy- 
namics x = f(x, u), and a similar nonlinear two-point boundary value problem 
may be formulated with Of /Ox replacing A and Of /Ou replacing B. This pro- 
cedure is extremely general, and may be used to numerically obtain nonlinear 
optimal control trajectories. 
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Hamiltonian Formulation. Similar to the Lagrangian formulation above, it is 
also possible to solve the optimization problem by introducing the following 
Hamiltonian: 


H = $(x*Qx + u*Ru) + A*(Ax + Bu). (8.54) 


Then Hamilton’s equations become 


x= (3) ' = Ax + Bu, x(0)=Xxo, (8.55a) 
-À = 9) = Qx + A*A, A(t) = Qrx(t;). (8.55b) 


Again, this is a two-point boundary value problem in x and A. Plugging in the 
same expression A = Px will result in the same Riccati equation as above. 


8.5 Optimal Full-State Estimation: the Kalman Fil- 
ter 


The optimal LQR controller from Section 8.4|relies on full-state measurements 
of the system. However, full-state measurements may be either prohibitively 
expensive or technologically infeasible to obtain, especially for high-dimensional 
systems. The computational burden of collecting and processing full-state mea- 
surements may also introduce unacceptable time delays that will limit robust 
performance. 

Instead of measuring the full state x, it may be possible to estimate the state 
from limited noisy measurements y. In fact, full-state estimation is mathemati- 
cally possible as long as the pair (A, C) are observable, although the effective- 
ness of estimation depends on the degree of observability as quantified by the 
observability Gramian. The Kalman filter is the most commonly 
used full-state estimator, as it optimally balances the competing effects of mea- 
surement noise, disturbances, and model uncertainty. As will be shown in the 
next section, it is possible to use the full-state estimate from a Kalman filter in 
conjunction with the optimal full-state LOR feedback law. 

When deriving the optimal full-state estimator, it is necessary to reintroduce 
disturbances to the state, wz, and sensor noise, w,,: 


d 
as = Ax + Bu + wy, (8.56a) 


y = Cx + Du + w. (8.56b) 
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The Kalman filter assumes that both the disturbance and noise are zero-mean 
Gaussian processes with known covariances: 


E(wa(t)wa(T)*) = Vad(t — 7), (8.57a) 
E(w,(t)w,(7)*) = V,d(t — T). (8.57b) 


Here E is the expected value and ô(-) is the Dirac delta function. The matrices 
V, and V,, are positive semi-definite with entries containing the covariances 
of the disturbance and noise terms. Extensions to the Kalman filter exist for 
correlated, biased, and unknown noise and disturbance terms [489] [671]. 

It is possible to obtain an estimate x of the full state x from measurements 
of the input u and output y, via the following estimator dynamical system: 


d 
—% = AX+Bu+K;(y—9), (8.58a) 


di 
y = CR + Du. (8.58b) 


The matrices A, B, C, and D are obtained from the system model, and the filter 
gain K; is determined via a similar procedure as in LOR. Thus K; is given by 


K; = YC*V; |, (8.59) 
where Y is the solution to another algebraic Riccati equation: 
YA* + AY — YC*V7'CY + V,=0. (8.60) 


This solution is commonly referred to as the Kalman filter, and it is the optimal 
full-state estimator with respect to the following cost function: 


J = lim E((x(t) — X(t))*(x(t) — X(t))). (8.61) 


t— oo 


This cost function implicitly includes the effects of disturbance and noise, which 
are required to determine the optimal balance between aggressive estimation 
and noise attenuation. Thus, the Kalman filter is referred to as linear—quadratic 
estimation (LQE), and has a dual formulation to the LOR optimization. The cost 
in (8.61) is computed as an ensemble average over many realizations. 

The Kalman filter gain K; may be determined via 


\| >> Kf = lge(A,1I,C,Vd,Vn); % design Kalman filter gain 


where I is the n x n identity matrix. Optimal control and estimation are math- 
ematical dual problems, as are controllability and observability, so the Kalman 
filter may also be found using LOR: 


|>> K£ = (lqr(A’,C’,Vd,Vn))'; % LOR and LQE are dual problems 
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System 


d 
—| —x = Ax + Bu + w4 


dt Kalman filter 
dr a 
T (A — KyC)x 


+Kyy+Bu 


Figure 8.10: Schematic of the Kalman filter for full-state estimation from noisy 
measurements y = Cx+w,, with process noise (disturbances) wy. This diagram 
does not have a feedthrough term D, although it may be included. 


The Kalman filter is shown schematically in Fig. 
Substituting the output estimate y from (8.58b) into (8.58a) yields 


—% = (A —K,C)&+Kyy + (B—K;D)u (8.62a) 


=(A—K,C)x+ [K; (B—K;D)| d (8.62b) 


The estimator dynamical system is expressed in terms of the estimate x with 
inputs y and u. If the system is observable, it is possible to place the eigenvalues 
of A—K,;C arbitrarily with choice of K ;. When the eigenvalues of the estimator 
are stable, then the state estimate x converges to the full state x asymptotically, 
as long as the model faithfully captures the true system dynamics. To see this 
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convergence, consider the dynamics of the estimation error € = x — x: 


d d d 
dt© dt dt 
= [Ax + Bu + wz] _— (A _— KC) + K;y + (B — K;D)u] 
= Ae + wa + KCà —Kyy + K;Du 
= Ae + wa + KCà — K; [Cx + Du + wn] +K ;Du 
C 


y 


A 


Therefore, the estimate x will converge to the true full state when A — K;C 
has stable eigenvalues. As with LOR, there is a tradeoff between over-stabilization 
of these eigenvalues and the amplification of sensor noise. This is similar to 
the behavior of an inexperienced driver who may hold the steering wheel too 
tightly and will overreact to every minor bump and disturbance on the road. 

There are many variants of the Kalman filter for nonlinear systems 
361) [729], including the extended and unscented Kalman filters. The ensemble 
Kalman filter is an extension that works well for high-dimensional sys- 
tems, such as in geophysical data assimilation [596]. All of these methods still 
assume Gaussian noise processes, and the particle filter provides a more gen- 
eral, although more computationally intensive, alternative that can handle ar- 
bitrary noise distributions [306)|602]. The unscented Kalman filter balances the 
efficiency of the Kalman filter and accuracy of the particle filter. 


8.6 Optimal Sensor-Based Control: Linear-Quadratic 
Gaussian (LOG) 


The full-state estimate from the Kalman filter is generally used in conjunction 
with the full-state feedback control law from LOR, resulting in optimal sensor- 
based feedback. Remarkably, the LOR gain K, and the Kalman filter gain Kẹ 
may be designed separately, and the resulting sensor-based feedback will re- 
main optimal and retain the closed-loop eigenvalues when combined. 
Combining the LOR full-state feedback with the Kalman filter full-state esti- 
mator results in the linear—-quadratic Gaussian (LQG) controller. The LQG con- 
troller is a dynamical system with input y, output u, and internal state x: 


d 
qe = (A — K)C — BK, )& + Ky, (8.63a) 


u = —K,x. (8.63b) 
The LQG controller is optimal with respect to the following ensemble-averaged 
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System 


_—_—_—_—— ey U 


dt 
y = Cx 


Kalman 
filter 


Figure 8.11: Schematic illustrating the linear—-quadratic Gaussian (LQG) con- 
troller for optimal closed-loop feedback based on noisy measurements y. The 
optimal LOR and Kalman filter gain matrices K, and K ; may be designed inde- 
pendently, based on two different algebraic Riccati equations. When combined, 
the resulting sensor-based feedback remains optimal. 


version of the cost function from (8.44): 


IO = ( f KOA +u Raar): (8.64) 


The controller u = —K,x is in terms of the state estimate, and so this cost 
function must be averaged over many realizations of the disturbance and noise. 
Applying LOR to x results in the following state dynamics: 


d 

as = Ax — BK,.x+ w4 (8.65a) 
= Ax — BK,x+ BK, (x — x) + wa (8.65b) 
= Ax — BK,x+ BK,e + wy. (8.65c) 


Again e = x — x as before. Finally, the closed-loop system may be written as 


d |x A — BK, BK., x I 0 Wa 

lel [on atiwe kth te [mel 6 
Thus, the closed-loop eigenvalues of the LQG-regulated system are given by 
the eigenvalues of A — BK, and A — K;C, which were optimally chosen by the 
LQR and Kalman filter gain matrices, respectively. 

The LQG framework, shown in Fig. relies on an accurate model of the 
system and knowledge of the magnitudes of the disturbances and measure- 
ment noise, which are assumed to be Gaussian processes. In real-world sys- 
tems, each of these assumptions may be invalid, and even small time delays 
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and model uncertainty may destroy the robustness of LQG and result in in- 
stability [209]. The lack of robustness of LQG regulators to model uncertainty 
motivates the introduction of robust control in Section [8.8] For example, it is 
possible to robustify LQG regulators through a process known as loop-transfer 
recovery. However, despite robustness issues, LQG control is extremely effec- 
tive for many systems, and is among the most common control paradigms. 

In contrast to classical control approaches, such as proportional—integral— 
derivative (PID) control and designing faster inner-loop control and slow outer- 
loop control assuming a separation of timescales, LQG is able to handle multiple- 
input, multiple-output (MIMO) systems with overlapping timescales and multi- 
objective cost functions with no additional complexity in the algorithm or im- 
plementation. 


8.7 Case Study: Inverted Pendulum on a Cart 


To consolidate the concepts of optimal control, we will implement a stabiliz- 
ing controller for an inverted pendulum on a cart, shown in Fig. The full 
nonlinear dynamics are given by 


r=, (8.67a) 

. _ —m?L*g cos(9) sin(0) + mL?(mLw? sin(@) — ðv) + mL?u 

— mL?(M + m(1 — cos(0)2)) a) 

=w, (8.67c) 
. _ 2 . — _ 

a (m+ M)mgLsin(@) — mL cos(@)(mLw* sin(@) — dv) — mL coo (8.67¢) 


mL?(M + m(1 — cos(@)?)) 
where zx is the cart position, v is the velocity, 0 is the pendulum angle, w is the 
angular velocity, m is the pendulum mass, M is the cart mass, L is the pendu- 
lum arm, g is the gravitational acceleration, 6 is a friction damping on the cart, 


and u is a control force applied to the cart. 
The function pendcart, defined in Code[8.2] may be used to simulate the full 


nonlinear system in (8.67). 
Code 8.2: [MATLAB] Right-hand side function for inverted pendulum on cart. 


function dx = pendcart (x,m,M,L,g,4d,vu) 


v 
X 
| 


= sin(x(3)); 
cos (x(3)); 
D = m«L*xL* (M+tm« (1-Cx*2)); 


P 
X 
II 


/D) * (-m*2*L*2*g*Cx*Sx + maL 2x (m*L*x (4) 29x - d 
K2 See Tape Ee (E/T ota 
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Figure 8.12: Schematic of inverted pendulum on a cart. The control forcing acts 
to accelerate or decelerate the cart. For this example, we assume the following 
parameter values: pendulum mass (m = 1), cart mass (M = 5), pendulum 
length (L = 2), gravitational acceleration (g = —10), and cart damping (ô = 1). 


dx (3,1) = x(4); 
dx(4,1) = (1/D) * ((m+M) «m*g*L*Sx —- m*L*Cxx* (mxL*x(4)^2xSx - dx 
x (2))) = m*eL*Cxx (1/D) «u; 


Code 8.2: [Python] Right-hand side function for inverted pendulum on cart. 


def pendcary œ; tm; M,- li, gra UE): 
u = uf(x) # evaluate anonymous function at x 
Sz > np.-sin(xi2]) 
Cx =- np. eos (x2) 
D = m*xLxL* (M+ms« (1-Cx**2) ) 


dx = np.zeros (4) 

ax[0] = x[1] 

dx[1] = (1/D) * (— (m*«*2) * (L**2) *g*Cx*Sx + mx (Lx*2) * (m*xLx (x 
[3]**2)*Sx ~ d*xx[1])) + mL (1/D) xu 

dxl2] — 138i] 

dx[3] = (1/D) * ( (m+M) «m«g*L*Sx — m*L*Cxx* (m*Lx« (x [3] **2) *Sx 
=- d*x[1])) -~ m*L*Cxx (1/D) «u; 


return dx 
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There are two fixed points, corresponding to either the pendulum-down 
(0 = 0) or pendulum-up (6 = 7) configuration; in both cases, v = w = 0 for the 
fixed point, and the cart position x is a free variable, as the equations do not 
depend explicitly on x. It is possible to linearize the equations in (8.67) about 
either the up or down solutions, yielding the following linearized dynamics: 


Ly 0 1 0 0 XY 0 
d |ao| |0 -6/M bmg/M 0| z2 1/M 
d lz |0 0 0 Hah <6 (809) 
LA 0 —bd/ML —b(m+M)g/ML 0] |z4 b/ML 
for 
Vy T 
T2)} |v 
T3 > 9}? 
T4 W 


where b = 1 for the pendulum-up fixed point, and b = —1 for the pendulum- 
down fixed point. The system matrices A and B are initialized in Code 
using the values for the constants given in Fig. 


Code 8.3: [MATLAB] Construct system matrices for inverted pendulum on a 
cart. 


m = 1; M= 5; L = 2; g = -10; d = 1; 


b = 1; % Pendulum up (b=1) 


A= [0 1 0 0; 
0 -d/M bxm+g/M 0; 


OF © Oils 
0 -bxd/(M*L) -bx (m+M)«g/(M«*L) 0]; 
i OR TM On ioral (Uc een) Ili 


Code 8.3: [Python] Construct system matrices for inverted pendulum on a cart. 
i IA el oe ipsa a eS SL el Se 


b 1 # pendulum up (b=1) 


A — np array Ioi OF, Ole 10, —d/M,bameg/M, Ol; Oro Opi. Op — 
b*xd/ (M«L) ,-b«* (m+M) «g/ (M*L),0]]) 


B = np.array([0,1/M,0,b/ (M«xL)]).reshape( (4,1) ) 


We may also confirm that the open-loop system is unstable by checking the 
eigenvalues of A using eig (A) in MATLAB or np.linalg.eig(A) in Python, 
which returns 
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|| array ([ 0. a EAS a O e S A 


In the following, we will test for controllability and observability, develop 
full-state feedback (LQR), full-state estimation (Kalman filter), and sensor-based 
feedback (LQG) solutions. 


Full-State Feedback Control of the Cart-Pendulum 


In this section, we will design an LQR controller to stabilize the inverted pen- 
dulum configuration (9 = 7) assuming full-state measurements, y = x. Be- 
fore any control design, we must confirm that the system is linearly control- 
lable with the given A and B matrices. This is accomplished by computing the 
rank of the controllability matrix using either rank (crtr (A,B)) in MATLAB 
or numpy. Linalg.matrix_rank (ctrb (A,B) ) in Python, which returns a rank 
of 4. Thus, the pair (A, B) is controllable, since the controllability matrix has full 
rank. It is then possible to specify given Q and R matrices for the cost function 
and design the LOR controller gain matrix K, as in Code|8.4} 


Code 8.4: [MATLAB] Design LOR controller to stabilize inverted pendulum on 
a cart. 


Q = eye(4); % state cost, 4x4 identity matrix 
R = 20000; o COREL COSE 
a IgE (A,B,Q,R); 


Code 8.4: [Python] Design LOR controller to stabilize inverted pendulum on a 
cart. 


Q = np.eye(4) # state cost, 4x4 identity matrix 
R= OLOCO # CONETOL COSE 


cr ep (A, B; Q, R) [0] 
We may then simulate the closed-loop system response of the full nonlin- 
ear system. We will initialize our simulation slightly off equilibrium, at xo = 


[-1 0 r+0.1 0] T, and we also impose a desired step change in the refer- 
ence position of the cart, from z = —1 to z = 1. 


Code 8.5: [MATLAB] Simulate closed-loop inverted pendulum on a cart sys- 
tem. 


: s intetadl condition 
Wes = (lh iO; a 103] % reference position 
u=@ (x) -—K* (x - wr); 2% COneEnol Way 
[t,x] = ode45(@(t,x) pendcart (x,m,M,L,g,d,u(x)),tspan, x0); 
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State 


Time 


Figure 8.13: Closed-loop system response of inverted pendulum on a cart sta- 
bilized with an LOR controller. 


Code 8.5: [Python] Simulate closed-loop inverted pendulum on a cart system. 


tspan = np.arange(0,10,0.001) 
P40) ee np-array (Pl ay eters iemaOl Ib Ohl) 7 iGeieareanl Condi tron 


We = np -array (il; 07 np pa, Oil) # Reference position 
u = lambda x: -K@(x-wr) # Control law 
x = integrate.odeint (pendcart, x0,tspan, args=(m,M,L,g,d,vu) ) 


In this code, the actuation is set to 
u = -K(x —- w,), (8.69) 


wherew, = [1 0 ~ 0] T is the reference position. The closed-loop response is 
shown in Fig. 

In the above procedure, specifying the system dynamics and simulating the 
closed-loop system response is considerably more involved than actually de- 
signing the controller, which amounts to a single function call in MATLAB and 
Python. It is also helpful to compare the LQR response to the response obtained 
by non-optimal eigenvalue placement. In particular, Fig.|8.14|shows the system 
response and cost function for 100 randomly generated sets of stable eigen- 
values, chosen in the interval |—3.5, —0.5]. The LOR controller has the lowest 
overall cost, as it is chosen to minimize J. The code to plot the pendulum-—cart 
system is provided online. 


Non-Minimum-Phase Systems. It can be seen from the response that, in or- 
der to move from x = —1 to x = 1, the system initially moves in the wrong 
direction. This behavior indicates that the system is a non-minimum-phase sys- 
tem, which introduces challenges for robust control, as we will soon see. There 
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411 


Figure 8.14: Comparison of LOR controller response and cost function with 
other pole placement locations. Bold lines represent the LOR solutions. 
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are many examples of non-minimum-phase systems in control. For instance, 
parallel parking an automobile first involves moving the center of mass of the 
car away from the curb before it then moves closer. Other examples include 
increasing altitude in an aircraft, where the elevators must first move the cen- 
ter of mass down to increase the angle of attack on the main wings before lift 
increases the altitude. Adding cold fuel to a turbine may also initially drop the 
temperature before it eventually increases. 


Full-State Estimation of the Cart-Pendulum 


Now we turn to the full-state estimation problem based on limited noisy mea- 
surements y. For this example, we will develop the Kalman filter for the pendulum- 
down condition (0 = 0), since without feedback the system in the pendulum- 
up condition will quickly leave the fixed point where the linear model is valid. 
When we combine the Kalman filter with LOR in the next example, it will be 
possible to control the unstable inverted pendulum configuration. Switching to 
the pendulum-down configuration is simple in the code by setting b = —1. 

Before designing a Kalman filter, we must choose a sensor and test for ob- 
servability. If we measure the cart position, y = xı, which corresponds to a 
matrix C = [1 0 0 OJ, then the observability matrix has a full rank of 4. 
This may be confirmed using the command rank (obsv (A, C) ) in MATLAB or 
numpy.linalg.matrix_rank (obsv(A,C)) in Python. 

Because the cart position x, does not appear explicitly in the dynamics, the 
system is not fully observable for any measurement that does not include 7}. 
Thus, it is impossible to estimate the cart position with a measurement of the 
pendulum angle. However, if the cart position is not important for the cost 
function (i.e., if we only want to stabilize the pendulum, and do not care where 
the cart is located), then other choices of sensor will be admissible. 

Now we design the Kalman filter, specifying disturbance and noise covari- 


ances, in Code[8.6| 


Code 8.6: [MATLAB] Code to specify disturbance and noise magnitudes and 
develop Kalman filter gain. 

Vd = eye(4); 
Vn = 1; 


disturbance covariance 


a 
a 


noise covariance 


© Build Kalman filter 

[Kf£,P,E] = lgqe(A,eye(4),C,Vd,Vn) ; @ design Kalman filter 
% alternatively, possible to design using "LOR" code 

Re = tare! oor; vd, Va) = 


Code 8.6: [Python] Code to specify disturbance and noise magnitudes and de- 
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velop Kalman filter gain. 


Vd = np.eye (4) # disturbance covariance 
Va = l # noise covariance 


4 Bualad Kalman Filter 
Kf, P, E = lqe(A,np.eye(4),C,Vd, Vn) 


The Kalman filter gain matrix is given by 


Ky = [1.9222 1.3474 —0.6182 —1.8016]”. 


To simulate the system and Kalman filter, we must augment the original 


system to include disturbance and noise inputs, as in Code f8.7} 


Code 8.7: [MATLAB] Augment system inputs with disturbance and noise 


terms, and create Kalman filter system. 


u Ixwd Oxwn] 


B_aug = [B eye(4) OxB]; fi 
D matrix passes noise through 


D avg = [007 000) Ti; 


a 
o 


o 


Syst = SSA B aug O DD aug); % single-measurement system 


o 


3% "true" system w/ full-state output, disturbance, no noise 
sysTruth = ss(A,B_aug,eye(4), zeros (4,size(B_aug,2))); 


sySKF = ss(A-Kf«*C, IB Kf],eye(4),0*[B Kf]); % Kalman filter 


Code 8.7: [Python] Augment system inputs with disturbance and noise terms, 


and create Kalman filter system. 


Baug = np.concatenate((B, np.eye(4),np.zeros_like(B)),axis 
=1) # [u I*wd O-«wn] 

Daug = np.array([0,0,0,0,0,1]) # D matrix passes noise 
through 

sysC = ss(A,Baug,C,Daug) # Single-measurement system 


# "True" system w/ full-state output, disturbance, no noise 
sysTruth = ss(A,Baug,np.eye(4),np.zeros((4,Baug.shape[1]))) 


BKf = np.concatenate((B,np.atleast_2d(Kf) .T),axis=1) 
SyYSKF = ss(A-np.outer (Kf,C),BKf,np.eye(4),np.zeros_like (BKf) 


) 
Code|8.8|simulates the system with a single output measurement, including 


additive disturbances and noise, and we use this as the input to a Kalman filter 
estimator. At times t = 1 and t = 15, we give the system a large positive and 


negative impulse in the actuation, respectively. 
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Code 8.8: [MATLAB] Simulate system and estimate full state. 


3% Estimate linearized system in "down" position 
le = (Oiler 


E che Hee 10 

uDIST = sqrt (Vd) *randn(4,size(t,2)); % random disturbance 
uNOISE = sqrt (Vn) *xrandn(size(t)); $ random noise 

u = Oxt; 

u(l/dt) = 20/dt; % positive impulse 

u(15/dt) = -20/dt; % negative impulse 

u_aug = [u; uDIST; uNOISE]; %@ input w/ disturbance and noise 
rel = Usami(sysG au aug, i); % noisy measurement 
[Rtas t — rsim(sysIruth u aug; t); >% Erue state 

khat, ek= Tsim Gyo KE u ay || ica % state estimate 


Code 8.8: [Python] Simulate system and estimate full state. 


## Estimate linearized system in down position: Gantry crane 
dt = 0.01 
t = np.arange(0,50,dt) 


uDIST = np-sgrt (Vd) @ np.random.randn(4,len(t)) # random 


disturbance 

uNOISE = np.sqrt(Vn) x np.random.randn(len(t) ) # random 
noise 

u = np.zeros_like(t) 

NO Ol — 2. OF aie # positive impulse 

u[1500] = -20/dt # negative impulse 


# input w/ disturbance and noise: 
uAUG = np.concatenate((u.reshape((1,len(u))),uDIST,uNOISE. 


reshape ((1,len(uNOISE))))).T 
y,t,_ = lsim(sysC,uAUG,t) # noisy 
measurement 
xtrue,t,_ = lsim(sysTruth, uAUG,t) # true state 
Nat, it, = Usim(Ssyske, np. rowostack ((ú,y))-LE) ~ estimate 


Figure shows the noisy measurement signal used by the Kalman filter, 


and Fig. |8.16}shows the full noiseless state, with disturbances, along with the 
Kalman filter estimate. 


To build intuition, it is recommended that the reader investigate the per- 


formance of the Kalman filter when the model is an imperfect representation 
of the simulated dynamics. When combined with full-state control in the next 
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Figure 8.15: Noisy measurement that is used for the Kalman filter, along with 
the underlying noiseless signal and the Kalman filter estimate. 


25 T T T T T T 


20 


OCE DDSI & 


Figure 8.16: The true and Kalman filter estimated states for the pendulum on a 
cart system. 
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cartpend_sim 


Scope 


Nonlinear ODE for 
Inverted Pendulum on a Cart 


| Step x 
4 
x' = Ax+Bu 
LOR T4 Kalman Filter 


Measurement 


Figure 8.17: MATLAB Simulink model for sensor-based LQG feedback control. 


section, small time delays and changes to the system model may cause fragility. 


Sensor-Based Feedback Control of the Cart-Pendulum 


To apply an LQG regulator to the inverted pendulum on a cart, we will simu- 
late the full nonlinear system in Simulink, as shown in Fig. The nonlinear 
dynamics are encapsulated in the block cartpend_sim, and the inputs consist 
of the actuation signal u and disturbance wy. We record the full state for per- 
formance analysis, although only noisy measurements y = Cx + w, and the 
actuation signal u are passed to the Kalman filter. The full-state estimate is then 
passed to the LOR block, which commands the desired actuation signal. For 
this example, we use the following LOR and LQE weighting matrices: Q = Iyx4, 
R = 0.000001, V4 = 0.0414,.4, and V, = 0.0002. 

The system starts near the vertical equilibrium, at x» = [0 0 3.14 0] T 
and we command a step in the cart position from x = 0 to x = 1 at t = 10. The 
resulting response is shown in Fig. Despite noisy measurements (Fig. 
and disturbances (Fig. |8.20), the controller is able to effectively track the refer- 
ence cart position while stabilizing the inverted pendulum. 
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Figure 8.18: Output response using LQG feedback control. 
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Figure 8.19: Noisy measurement used for the Kalman filter, along with the un- 
derlying noiseless signal and the Kalman filter estimate. 
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Figure 8.20: Disturbance and actuation signals. 
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8.8 Robust Control and Frequency-Domain Techniques 


Until now, we have described control systems in terms of state-space systems 
of ordinary differential equations. This perspective readily lends itself to stabil- 
ity analysis and design via placement of closed-loop eigenvalues. However, in 
a seminal paper by John Doyle in 1978 it was shown that LQG regula- 
tors can have arbitrarily small stability margins, making them fragile to model 
uncertainties, time delays, and other model imperfections. 

Fortunately, a short time after Doyle’s famous 1978 paper, a rigorous math- 
ematical theory was developed to design controllers that promote robustness. 
Indeed, this new theory of robust control generalizes the optimal control frame- 
work used to develop LOR/LQG, by incorporating a different cost function 
that penalizes worse-case scenario performance. 

To understand and design controllers for robust performance, it will be 
helpful to look at frequency-domain transfer functions of various signals. In par- 
ticular, we will consider the sensitivity, complementary sensitivity, and loop 
transfer functions. These enable quantitative and visual approaches to assess 
robust performance, and they enable intuitive and compact representations of 
control systems. 

Robust control is a natural perspective when considering uncertain models 
obtained from noisy or incomplete data. Moreover, it may be possible to man- 
age system nonlinearity as a form of structured model uncertainty. Finally, we 
will discuss known factors that limit robust performance, including time delays 
and non-minimum-phase behavior. 


Frequency-Domain Techniques 


To understand and manage the tradeoffs between robustness and performance 
in a control system, it is helpful to design and analyze controllers using frequency- 
domain techniques. 

The Laplace transform allows us to go between the time domain (state space) 
and frequency domain: 


cs} = Fs) =f seat (8.70) 


Here, s is the complex-valued Laplace variable. The Laplace transform may 
be thought of as a one-sided generalized Fourier transform that is valid for 
functions that do not converge to zero as t —> oo. The Laplace transform is 
particularly useful because it transforms differential equations into algebraic 
equations, and convolution integrals in the time domain become simple prod- 
ucts in the frequency domain. To see how time derivatives pass through the 


°Title: Guaranteed margins for LQG regulators. “Abstract: There are none.” 
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Laplace transform, we use integration by parts: 


d = d —st 
elaro) af POA 


= FO") + sL{f(t)}- 
Thus, for zero initial conditions, L{df/dt} = s f(s). 
Taking the Laplace transform of the control system in (8.10) yields 
sx(s) = Ax(s) + Bu(s), (8.71a) 
y(s) = Cx(s) + Du(s). (8.71b) 


It is possible to solve for x(s) in the first equation, as 
(sI — A)x(s) =Bu(s) => x(s) =(sI— A)"'Bu(s). (8.72) 


Substituting this into the second equation, we arrive at a mapping from inputs 
u to outputs y: 


y(s) = [C(sI — A) 'C + D]u(s). (8.73) 


We define this mapping as the transfer function: 
G(s) = y(s) _ C(sI— A) B +D. (8.74) 


For linear systems, there are three equivalent representations: (1) time do- 
main, in terms of the impulse response; (2) frequency domain, in terms of the 
transfer function; and (3) state space, in terms of a system of differential equa- 
tions. These representations are shown schematically in Fig. As we will 
see, there are many benefits to analyzing control systems in the frequency do- 
main. 


Frequency Response 


The transfer function in (8.74) is particularly useful because it gives rise to the 
frequency response, which is a graphical representation of the control system 
in terms of measurable data. To illustrate this, we will consider a single-input, 
single-output (SISO) system. It is a property of linear systems with zero initial 
conditions that a sinusoidal input will give rise to a sinusoidal output with the 
same frequency, perhaps with a different magnitude A and phase ¢: 


u(t) =sin(wt) => y(t) = Asin(wt + 9). (8.75) 
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Linear time-invariant 
(LTT) systems 


Laplace 
transform 


G(s) = L{g(t)} 


Eigensystem 
realization 


Canonical realization 
(not unique) 


G(s) = C(sI— A) tB +D 


Figure 8.21: Three equivalent representations of linear time-invariant systems. 


This is true for long times, after initial transients die out. The amplitude A and 
phase ¢ of the output sinusoid depend on the input frequency w. These func- 
tions A(w) and ¢(w) may be mapped out by running a number of experiments 
with sinusoidal input at different frequencies w. Alternatively, this information 
is obtained from the complex-valued transfer function G(s): 


A(w) = |G(iw)|, o(w) = ZG (iw). (8.76) 


Thus, the amplitude and phase angle for input sin(wt) may be obtained by eval- 
uating the transfer function at s = iw (i.e., along the imaginary axis in the com- 
plex plane). These quantities may then be plotted, resulting in the frequency 
response or Bode plot. 

For a concrete example, consider the spring—mass—damper system, shown 
in Fig. The equations of motion are given by 


mi = —ôt — kx + u. (8.77) 
Choosing values m = 1, ô = 1, k = 2, and taking the Laplace transform yields 
1 
Seo 8.78 
AT a (8:78) 


Here we are assuming that the output y is a measurement of the position x of 
the mass. Note that the denominator of the transfer function G(s) is the char- 
acteristic equation of (8.77), written in state-space form. Thus, the poles of the 
complex function G(s) are eigenvalues of the state-space system. 
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Figure 8.22: Spring—mass—damper system. 


Bode Diagram 


Magnitude (dB) 


107 10° 10! 10? 
Frequency (rad/s) 


Figure 8.23: Frequency response of spring—mass—damper system. The magni- 
tude is plotted on a logarithmic scale, in units of decibel (dB), and the frequency 
is likewise on a log scale. 


It is now possible to create this system and plot the frequency response (i.e., 
the Bode plot), as shown in Fig. |8.23]and computed in Code[8.9| Note that the 
frequency response is readily interpretable and provides physical intuition. For 
example, the zero slope of the magnitude at low frequencies indicates that slow 
forcing translates directly into motion of the mass, while the roll-off of the mag- 
nitude at high frequencies indicates that fast forcing is attenuated and does not 
significantly affect the motion of the mass. Moreover, the resonance frequency 
is seen as a peak in the magnitude, indicating an amplification of forcing at this 
frequency. Code|8.9|also shows how to manipulate state-space realizations into 
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frequency-domain representations, and vice versa. 


Code 8.9: [MATLAB] Create transfer function and plot frequency response 
(Bode) plot. Convert between state-space and frequency-domain representa- 
tions. 

s=- cfs’); 

C WAN A ae ih e 
bode (G); 


Laplace variable 
Trans ter GunCeLon 
Frequency response 


alo o o 


oo 


State space realization 
= O eca als 
[07 i]; 
= sl ils 
0 


, 


G o w 32 


© Conyere to freguency domain 
[num, den] = ss2tf (A,B,C,D); 
G = tf (num, den) 


* Convert back to state space 
[A,B,C,D] = tf2ss(G.num{1},G.den{1}) 


Code 8.9: [Python] Create transfer function and plot frequency response (Bode) 
plot. Convert between state-space and frequency-domain representations. 

s cip- ten (PO np array LOr NNn 

G = 1/(s**2 + S + 2) 

w, mag, phase = bode (G) 


# State space realization 

A np arra Aor ela le) 

B = np.array([0,1]).reshape((2,1)) 
C = np.array([1,0]) 

D = 0 


# Convert to frequency domain 
G = ss2ti(A, BoC) D) 7 returns transfer Tunceion 


# Convert back to state space 
sysSS= tf2ss(G) # returns state space system 


Note that the state-space representation is not unique, and going from a 
transfer function to state space may switch the order of the state variables. 

The frequency domain is also useful because impulsive or step inputs are 
particularly simple to represent with the Laplace transform. The impulse re- 
sponse (Fig. and step response (Fig. are computed in MATLAB by 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


8.8. ROBUST CONTROL AND FREQUENCY-DOMAIN TECHNIQUES 


Impulse Response 
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Figure 8.24: Impulse response of spring—mass—damper system. 
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Figure 8.25: Step response of spring—mass—damper system. 


>> impulse (G); @ Impulse response 
>> step(G); * Step response 


and in python-control by 


>>> ia,it = impulse(G) # Impulse response 
ole plor aerie) # Need to plot output of impulse 


423 
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Wd 


Wr +~€ | Controller u System 
a () 
_ K G 
Feedback signal F 
Wn | 


Figure 8.26: Closed-loop feedback control diagram with reference input, noise, 
and disturbance. We will consider the various transfer functions from exoge- 
nous inputs to the error e, thus deriving the loop transfer function, as well as 
the sensitivity and complementary sensitivity functions. 


>>> ia,it = step (G) # Step response 
Se ple plot (taia) # Need to plot output of step 


Performance and the Loop Transfer Function: Sensitivity and 
Complementary Sensitivity 
Consider a slightly modified version of Fig. oa il the disturbance has a 


model, P4. This new diagram, shown in Fig. (8.26, will be used to derive the 
important transfer functions relevant for assessing robust performance: 


y = GK(w, — y — Wn) + Gawa (8.79a) 
= (I+ GK)y = GKw, — GKw, + Gawa (8.79b) 
=> y = (I + GK) İGK w, — (I + GK) GK w, 

a B 
+ (I + GK)! Gawa. (8.79c) 
ee 


S 
Here, S is the sensitivity function and T is the complementary sensitivity function. 
We may denote L = GK, the loop transfer function, shown in Fig. which is 


the open-loop transfer function in the absence of feedback. Both S and T may 
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Bode Diagram 


Magnitude (dB) 


Phase (deg) 


10° 107 10° 10! 10? 
Frequency (rad/s) 


Figure 8.27: Loop transfer function L along with sensitivity S and complemen- 
tary sensitivity T functions. 


be simplified in terms of L: 


S=(I+L)", (8.80a) 
T = (I + 1) 7k: (8.80b) 


Conveniently, the sensitivity and complementary sensitivity functions must 
add up to the identity: S + T = I. 

In practice, the transfer function from the exogenous inputs to the noiseless 
error € is more useful for design: 


€=w,—y = Sw, + Tw, — SGawy. (8.81) 


Thus, we see that the sensitivity and complementary sensitivity functions 
provide the maps from reference, disturbance, and noise inputs to the tracking 
error. Since we desire small tracking error, we may then specify S and T to have 
desirable properties, and ideally we will be able to achieve these specifications 
by designing the loop transfer function L. In practice, we will choose the con- 
troller K with knowledge of the model G so that the loop transfer function has 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


426 CHAPTER 8. LINEAR CONTROL THEORY 


beneficial properties in the frequency domain. For example, small gain at high 
frequencies will attenuate sensor noise, since this will result in T being small. 
Similarly, high gain at low frequencies will provide good reference tracking 
performance, as S will be small at low frequencies. However, S and T cannot 
both be small everywhere, since S + T = I, from (8.80), and so these design 
objectives may compete. 

For performance and robustness, we want the maximum peak of S, Ms = 
||S||.c, to be as small as possible. From (8.81), it is clear that, in the absence of 
noise, feedback control improves performance (i.e., reduces error) for all fre- 
quencies where |S| < 1; thus control is effective when T ~ 1. As explained in 
p. 37], all real systems will have a range of frequencies where |S| > 1, in 
which case performance is degraded. Minimizing the peak Mg mitigates the 
amount of degradation experienced with feedback at these frequencies, im- 
proving performance. In addition, the minimum distance of the loop transfer 
function L to the point —1 in the complex plane is given by Mg‘. By the Nyquist 
stability theorem, the larger this distance, the greater the stability margin of the 
closed-loop system, improving robustness. These are the two major reasons to 
minimize Ms. 

The controller bandwidth wp is the frequency below which feedback control 
is effective. This is a subjective definition. Often, wg is the frequency where 
|\S(jw)| first crosses —3 dB from below. We would ideally like the controller 
bandwidth to be as large as possible without amplifying sensor noise, which 
typically has a high frequency. However, there are fundamental bandwidth lim- 
itations that are imposed for systems that have time delays or right half-plane 


zeros [665]. 


Inverting the Dynamics 


With a model of the form in or (8.73), it may be possible to design an 
open-loop control law to achieve some desired specification without the use of 
measurement-based feedback or feedforward control. For instance, if perfect 
tracking of the reference input w, is desired in Fig. under certain circum- 
stances it may be possible to design a controller by inverting the system dy- 
namics G, i.e., K(s) = G~'(s). In this case, the transfer function from reference 
wr to output s is given by GG“! = 1, so that the output perfectly matches the 
reference. However, perfect control is never possible in real-world systems, and 
this strategy should be used with caution, since it generally relies on a number 
of significant assumptions on the system G. First, effective control based on 
inversion requires extremely precise knowledge of G and well-characterized, 
predictable disturbances; there is little room for model errors or uncertainties, 
as there are no sensor measurements to determine if performance is as expected 
and no corrective feedback mechanisms to modify the actuation strategy to 
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compensate. 

For open-loop control using system inversion, G must also be stable. It is 
impossible to fundamentally change the dynamics of a linear system through 
open-loop control, and thus an unstable system cannot be stabilized without 
feedback. Attempting to stabilize an unstable system by inverting the dynam- 
ics will typically have disastrous consequences. For instance, consider the fol- 
lowing unstable system with a pole at s = 5 and a zero at s = —10: 


G(s) = (s + 10)/(s — 5). 
Inverting the dynamics would result in a controller 
K = (s — 5)/(s +10). 


However, if there is even the slightest uncertainty in the model, so that the true 
pole is at 5 — e, then the open-loop system will be 


Grue(s)K(s) = (s — 5)/(s —5 + €). 


This system is still unstable, despite the attempted pole cancelation. Moreover, 
the unstable mode is now nearly unobservable. 

In addition to stability, G must not have any time delays or zeros in the right 
half-plane, and it must have the same number of poles as zeros. If G has any 
zeros in the right half-plane, then the inverted controller K will be unstable, 
since it will have right half-plane poles. These systems are called non-minimum- 
phase, and there have been generalizations to dynamic inversion that provide 
bounded inverses to these systems [203]. Similarly, time delays are not invert- 
ible, and if G has more poles than zeros, then the resulting controller will not 
be realizable and may have extremely large actuation signals b. There are also 
generalizations that provide regularized model inversion, where optimization 
schemes are applied with penalty terms added to keep the resulting actuation 
signal b bounded. These regularized open-loop controllers are often signifi- 
cantly more effective, with improved robustness. 

Combined, these restrictions on G imply that model-based open-loop con- 
trol should only be used when the system is well behaved, accurately character- 
ized by a model, when disturbances are characterized, and when the additional 
feedback control hardware is unnecessarily expensive. Otherwise, performance 
goals must be modest. Open-loop model inversion is often used in manufac- 
turing and robotics, where systems are well characterized and constrained in a 
standard operating environment. 


Robust Control 


As discussed above, LQG controllers are known to have arbitrarily poor robust- 
ness margins. This is a serious problem in systems such as turbulence control, 
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neuromechanical systems, and epidemiology, where the dynamics are wrought 
with uncertainty and time delays. 

Figure|8.2|shows the most general schematic for closed-loop feedback con- 
trol, encompassing both optimal and robust control strategies. In the gener- 
alized theory of modern control, the goal is to minimize the transfer func- 
tion from exogenous inputs w (reference, disturbances, noise, etc.) to a multi- 
objective cost function J (accuracy, actuation cost, time-domain performance, 
etc.). Optimal control (e.g., LOR, LQE, LQG) is optimal with respect to the Hə- 
norm, a bounded 2-norm on a Hardy space, consisting of stable and strictly 
proper transfer functions (meaning gain rolls off at high frequency). Robust 
control is similarly optimal with respect to the Ha bounded infinity-norm, con- 
sisting of stable and proper transfer functions (gain does not grow infinite at 
high frequencies). The infinity-norm is defined as 


Glo £ max 01 (G(iw)). (8.82) 


Here, cı denotes the maximum singular value. Since the || - ||,.-norm is the 
maximum value of the transfer function at any frequency, it is often called a 
worst-case scenario norm; therefore, minimizing the infinity-norm provides ro- 
bustness to worst-case exogenous inputs. Ha robust controllers are used when 
robustness is important. There are many connections between Ha and Ha con- 
trol, as they exist within the same framework and simply optimize different 
norms. We refer the reader to the excellent reference books expanding on this 
theory [222! [665]. 

If we let G,,_,3 denote the transfer function from w to J, then the goal of Hæ 
control is to construct a controller to minimize the infinity-norm: min ||G,_,3||,0- 
This is typically difficult, and no analytic closed-form solution exists for the 
optimal controller in general. However, there are relatively efficient iterative 
methods to find a controller such that ||Gw_3||.. < y, as described in [211]. 
There are numerous conditions and caveats that describe when this method 
can be used. In addition, there are computationally efficient algorithms imple- 
mented in both MATLAB and Python, and these methods require relatively 
low overhead from the user. 

Selecting the cost function J to meet design specifications is a critically im- 
portant part of robust control design. Considerations such as disturbance re- 
jection, noise attenuation, controller bandwidth, and actuation cost may be ac- 
counted for by a weighted sum of the transfer functions S, T, and KS. In the 
mixed sensitivity control problem, various weighting transfer functions are used 
to balance the relative importance of these considerations at various frequency 
ranges. For instance, we may weight S by a low-pass filter and KS by a high- 
pass filter, so that disturbance rejection at low frequency is promoted and con- 
trol response at high frequency is discouraged. A general cost function may 
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consist of three weighting filters F multiplying S, T, and KS: 


FS 
FLT 
F;KS 


Another possible robust control design is called Ha loop shaping. This pro- 
cedure may be more straightforward than mixed sensitivity synthesis for many 
problems. The loop shaping method consists of two major steps. First, a desired 
open-loop transfer function is specified based on performance goals and clas- 
sical control design. Second, the shaped loop is made robust with respect to a 
large class of model uncertainty. Indeed, the procedure of Hœ loop shaping al- 
lows the user to design an ideal controller to meet performance specifications, 
such as rise-time, bandwidth, settling-time, etc. Typically, a loop shape should 
have large gain at low frequency to guarantee accurate reference tracking and 
slow disturbance rejection, low gain at high frequencies to attenuate sensor 
noise, and a crossover frequency that ensures desirable bandwidth. The loop 
transfer function is then robustified so that there are improved gain and phase 
margins. 

Hə optimal control (e.g., LOR, LQE, LQG) has been an extremely popular 
control paradigm because of its simple mathematical formulation and its tun- 
ability by user input. However, the advantages of Ha control are being increas- 
ingly realized. Additionally, there are numerous consumer software solutions 
that make implementation relatively straightforward. In MATLAB, mixed sen- 
sitivity is accomplished using the mixsyn command in the robust control tool- 
box. Similarly, loop shaping is accomplished using the loopsyn command in 
the robust control toolbox. 


Fundamental Limitations on Robust Performance 


As discussed above, we want to minimize the peaks of S and T to improve ro- 
bustness. Some peakedness is inevitable, and there are certain system charac- 
teristics that significantly limit performance and robustness. Most notably, time 
delays and right half-plane zeros of the open-loop system will limit the effec- 
tive control bandwidth and will increase the attainable lower bound for peaks 
of S and T. This contributes to both degrading performance and decreasing 
robustness. 

Similarly, a system will suffer from robust performance limitations if the 
number of poles exceeds the number of zeros by more than two. These funda- 
mental limitations are quantified in the waterbed integrals, which are so named 
because if you push a waterbed down in one location, it must rise in another. 
Thus, there are limits to how much one can push down peaks in S without 
causing other peaks to pop up. 
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Time delays are relatively easy to understand, since a time delay 7 will in- 
troduce an additional phase lag of Tw at the frequency w, limiting how fast the 
controller can respond effectively (i.e., bandwidth). Thus, the bandwidth for a 
controller with acceptable phase margins is typically wg < 1/7. 

Following the discussion in [665], these fundamental limitations may be un- 
derstood in relation to the limitations of open-loop control based on model in- 
version. If we consider high-gain feedback u = K(w, — y) for a system as in 
Fig. |8.26|and (8.81), but without disturbances or noise, we have 


u = Ke = KSw.,. (8.83) 


We may write this in terms of the complementary sensitivity T, by noting that 
since T = I — S, we have T = L(I + L) t = GKS: 


u = G”Tw,. (8.84) 


Thus, at frequencies where T is nearly the identity I and control is effective, the 
actuation is effectively inverting G. Even with sensor-based feedback, perfect 
control is unattainable. For example, if G has right half-plane zeros, then the ac- 
tuation signal will become unbounded if the gain K is too aggressive. Similarly, 
limitations arise with time delays and when the number of poles of G exceeds 
the number of zeros, as in the case of open-loop model-based inversion. 

As a final illustration of the limitation of right half-plane zeros, we consider 
the case of proportional control u = Ky in a SISO system with G(s) = N(s)/D(s). 
Here, roots of the numerator N (s) are zeros and roots of the denominator D(s) 
are poles. The closed-loop transfer function from reference w, to sensors s is 
given by 

y(s) Gk NK/D NK 


wl] 1+GK 1+NK/D D+NK’ (8:89) 


For small control gain K, the term NK in the denominator is small, and the 
poles of the closed-loop system are near the poles of G, given by roots of D. 
As K is increased, the NK term in the denominator begins to dominate, and 
closed-loop poles are attracted to the roots of N, which are the open-loop zeros 
of G. Thus, if there are right half-plane zeros of the open-loop system G, then 
high-gain proportional control will drive the system unstable. These effects are 
often observed in the root locus plot from classical control theory. In this way, 
we see that right half-plane zeros will directly impose limitations on the gain 
margin of the controller. 
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Homework 


Exercise 8-1. Give an example of a control system in your daily life. Describe 
the inputs and outputs. What are the system dynamics? What are the control 
objectives? 


Exercise 8-2. This example will explore the optimal control workflow on a ro- 
tary inverted pendulum. 


(a) Derive the equations of motion for a rotary pendulum, where the base of 
the pendulum is mounted to a rotating arm. The control input is a torque 
input to the rotor arm. 


(b) Identify the fixed points of the system and linearize about each fixed 
point. What is the stability of each fixed point? Determine the linear con- 
trollability of each fixed point. 


(c) Design an LOR controller for the pendulum-up configuration assuming 
full-state measurements. 


(d) Determine the observability of the pendulum-up configuration if we can- 
not measure the full state, but instead measure the pendulum angle and 
the rotor angle. Similarly, determine the observability if we only measure 
the pendulum angular rate and the rotor angle. Which sensor configura- 
tion is more observable? Pick at least one different sensor set and assess if 
this configuration is observable. 


(e) Assuming a measurement of the pendulum angle and rotor angle, design 
a Kalman filter to estimate the full state. Design for disturbance magni- 
tude 1 x 107? and sensor noise 1 x 107°. Simulate the noisy system and 
compare the Kalman filter estimate with the true state without added 
noise. 


(f) Design an LQG controller for the pendulum-up configuration and demon- 
strate this controller in simulation. How does the controller perform when 
you introduce a small time delay? At what point does the time delay 
cause the system to go unstable? 


Exercise 8-3. Derive the equations of motion for a double pendulum on a cart. 
Repeat each step above for the double pendulum. 

Exercise 8-4. This exercise will design a controller to move a cart with a pen- 
dulum in the down position from one point on a track to another. The goal 


will be to move quickly from point x = 0 to point x = 1 while minimizing the 
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amount the pendulum swings, which corresponds to the problem faced when 
designing a controller for a gantry crane. What will make this example more 
interesting is that it is a non-minimum-phase system. 


First, try using a simple proportional feedback controller, with gain K, where 
the control input is proportional to the error between the state and the goal. 
Plot the poles and zeros of the closed-loop system for a range of K, and explain 
the results. 


Design a full-state LOR controller to track a reference position of the cart, while 
minimizing the pendulum swinging. Try different LOR gain matrices, and com- 
pare the step response when setting the reference state from x = 0 to x = 1. 


Exercise 8-5. Generate two different state-space realizations for the following 
transfer function: 


1 


Ge) s?+35+2° 


Find a coordinate transformation between the states of the two systems. 


Exercise 8-6. This example will explore how an ill-conditioned controllability 
Gramian can affect state-feedback control. 


(a) Create a continuous-time, single-input state-space system that has n = 2 
states and full-state measurements (i.e., C = I). Design the system to be 
technically controllable, yet with one of the directions being much more 
controllable than the other (i.e., 10° times more controllable). Compute 
the controllability Gramian for this system and compute its eigendecom- 
position. Explain what you find. 


(b) Design an LQR controller for this system using weight matrices Q = I 
and R = 1. Simulate the response of the closed-loop system, and explain 
the results. 


(c) Using this controller, initialize the system at thousands of points ran- 
domly sampled from a unit circle where ||x|| = 1, and compute the LOR 
cost J for each of these trajectories. Plot the initial conditions on the sphere, 
color-coded by the cost J. Reconcile this plot with the controllability Gramian 
you computed earlier. 


Exercise 8-7. For the inverted pendulum on a cart, we will analyze the sensi- 
tivity and robustness of LQG control. Plot the sensitivity and complementary 
sensitivity for this system. What are the limits of robustness? 


Robustify the LQG controller using loop synthesis. Compare the robustness 
before and after. 
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For the LQG controller, introduce a small time delay to the system and char- 
acterize what changes in the control response and performance. At what size 
time delay does the system go unstable? Determine the units for a realistic- 
sized pendulum. 
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Balanced Models for Control 


Many systems of interest are exceedingly high-dimensional, making them dif- 
ficult to characterize. High dimensionality also limits controller robustness due 
to significant computational time delays. For example, for the governing equa- 
tions of fluid dynamics, the resulting discretized equations may have millions 
or billions of degrees of freedom, making them expensive to simulate. Thus, 
significant effort has gone into obtaining reduced-order models that capture 
the most relevant mechanisms and are suitable for feedback control. 

Unlike reduced-order models based on proper orthogonal decomposition 
(see Chapters [12] and [13), which order modes based on energy content in the 
data, here we will discuss a class of balanced reduced-order models that em- 
ploy a different inner product to order modes based on input-output energy. 
Thus, only modes that are both highly controllable and highly observable are 
selected, making balanced models ideal for control applications. In this chapter 
we also describe related procedures for model reduction and system identifica- 
tion, depending on whether or not the user starts with a high-fidelity model or 
simply has access to measurement data. 


9.1 Model Reduction and System Identification 


In many nonlinear systems, it is still possible to use linear control techniques. 
For example, in fluid dynamics there are numerous success stories of linear 
model-based flow control [240], for example to delay transition from 
laminar to turbulent flow in a spatially developing boundary layer, to reduce 
skin-friction drag in wall turbulence, and to stabilize the flow past an open cav- 
ity. However, many linear control approaches do not scale well to large state 
spaces, and they may be prohibitively expensive to enact for real-time control 
on short timescales. Thus, it is often necessary to develop low-dimensional ap- 
proximations of the system for use in real-time feedback control. 

There are two broad approaches to obtain reduced-order models (ROMs). 
First, it is possible to start with a high-dimensional system, such as the dis- 
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cretized Navier-Stokes equations, and project the dynamics onto a low-dimensional 
subspace identified, for example, using proper orthogonal decomposition (POD; 
Chapter and Galerkin projection [577]. There are numerous 
variations to this procedure, including the discrete gee interpolation method 
(DEIM; Section{13.5) [171556], gappy POD = aE 13.1) [239], balanced proper 
orthogonal decomposition (BPOD; Section ]9.2) [608}|755], and many more. The 
second approach is to collect data from a simulation or an experiment and iden- 
tify a low-rank model using data-driven techniques. This approach is typically 
called system identification, and is often preferred for control design because of 
the relative ease of implementation. Examples include the dynamic mode de- 
composition (DMD; Section wed [422) (611) |635}|727], the eigensystem realization 
algorithm (ERA; 5) ONE 8} 468], the observer Kalman filter identification 
(OKID; Section]9.3) | [85718591563], NARMAX [86], and the sparse identification 
of nonlinear dynamics (SINDy; Section|7.3) | 73) (SDI 

After a linear model has been identified, either by model reduction or sys- 
tem identification, it may then be used for model-based control design. How- 
ever, there are a number of issues that may arise in practice, as linear model- 
based control might not work for a large class of systems. First, the system be- 
ing modeled may be strongly nonlinear, in which case the linear approximation 
might only capture a small portion of the dynamic effects. Next, the system may 
be stochastically driven, so that the linear model will average out the relevant 
fluctuations. Finally, when control is applied to the full system, the attractor 
dynamics may change, rendering the linearized model invalid. Exceptions in- 
clude the stabilization of fixed points, where feedback control rejects nonlinear 
disturbances and keeps the system in a neighborhood of the fixed point where 
the linearized model is accurate. There are also methods for system identifica- 
tion and model reduction that are nonlinear, involve stochasticity, and change 
with the attractor. However, these methods are typically advanced and they 
also may limit the available machinery from control theory. 


9.2 Balanced Model Reduction 


The high dimensionality and short timescales associated with complex systems 
may render the model-based control strategies described in Chapter [8] infeasi- 
ble for real-time applications. Moreover, obtaining H and Ha optimal con- 
trollers may be computationally intractable, as they involve either solving a 
high-dimensional Riccati equation, or an expensive iterative optimization. As 
has been demonstrated throughout this book, even if the ambient dimension 
is large, there may still be a few dominant coherent structures that character- 
ize the system. Reduced-order models provide efficient, low-dimensional rep- 
resentations of these most relevant mechanisms. Low-order models may then 
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be used to design efficient controllers that can be applied in real time, even for 
high-dimensional systems. An alternative is to develop controllers based on the 
full-dimensional model and then apply model reduction techniques directly to 
the full controller 537]. 

Model reduction is essentially data reduction that respects the fact that the 
data is generated by a dynamic process. If the dynamical system is a linear 
time-invariant (LTI) input-output system, then there is a wealth of machinery 
available for model reduction, and performance bounds may be quantified. The 
techniques explored here are based on the singular value decomposition (SVD; 
Chapter|1) [1442851286], and the minimal realization theory of Ho and Kalman 
[329/1509]. The general idea is to determine a hierarchical modal decomposition 
of the system state that may be truncated at some model order, only keeping 
the coherent structures that are most important for control. 


The Goal of Model Reduction 
Consider a high-dimensional system, depicted schematically in Fig. 


d 
“k= Ax + Bu, (9.1a) 


y = Cx + Du, (9.1b) 


for example, from a spatially discretized simulation of a partial differential 
equation (PDE). The primary goal of model reduction is to find a coordinate 
transformation x = Wx giving rise to a related system (A,B, G, D) with simi- 
lar input-output characteristics, 


cx = Ax+Bu, (9.2a) 
y = Cx+ Du, (9.2b) 


in terms of a state x € R” with reduced dimension, r < n. Note that u and 
y are the same in and even though the system states are different. 
Obtaining the projection operator W will be the focus of this section. 

As a motivating example, consider the following simplified model: 


d [z —2 0||z 1 
dt H = i °] H T 10-10 u, (9.3a) 
y= [k 10] H (9.3b) 


In this case, the state x» is barely controllable and barely observable. Simply 
choosing « = xı will result in a reduced-order model that faithfully captures 
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System 


Figure 9.1: Input-output system. A control-oriented reduced-order model will 
capture the transfer function from u to y. 


the input-output dynamics. Although the choice x = xı seems intuitive in this 
extreme case, many model reduction techniques would erroneously favor the 
state © = xv, since it is more lightly damped. Throughout this section, we will 
investigate how to accurately and efficiently find the transformation matrix Y 
that best captures the input-output dynamics. 

The proper orthogonal decomposition from Chapter [12| provides 
a transform matrix Y, the columns of which are modes that are ordered based 
on energy content}|] POD has been widely used to generate ROMs of complex 
systems, many for control, and it is guaranteed to provide an optimal low- 
rank basis to capture the maximal energy or variance in a data set. However, 
it may be the case that the most energetic modes are nearly uncontrollable or 
unobservable, and therefore may not be relevant for control. Similarly, in many 
cases the most controllable and observable state directions may have very low 
energy; for example, acoustic modes typically have very low energy, yet they 
mediate the dominant input-output dynamics in many fluid systems. The rud- 
der on a ship provides a good analogy: although it accounts for a small amount 
of the total energy, it is dynamically important for control. 

Instead of ordering modes based on energy, it is possible to determine a 
hierarchy of modes that are most controllable and observable, therefore captur- 
ing the most input-output information. These modes give rise to balanced mod- 
els, giving equal weighting to the controllability and observability of a state 
via a coordinate transformation that makes the controllability and observabil- 
ity Gramians equal and diagonal. These models have been extremely success- 
ful, although computing a balanced model using traditional methods is pro- 
hibitively expensive for high-dimensional systems. In this section, we describe 
the balancing procedure, as well as modern methods for efficient computation 
of balanced models. A computationally efficient suite of algorithms for model 
reduction and system identification may be found in [71]. 

A balanced reduced-order model should map inputs to outputs as faithfully 
as possible for a given model order r. It is therefore important to introduce an 


‘When the training data consists of velocity fields, for example from a high-dimensional 
discretized fluid system, then the singular values literally indicate the kinetic energy content of 
the associated mode. It is common to refer to POD modes as being ordered by energy content, 
even in other applications, although variance is more technically correct. 
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operator norm to quantify how similarly and act on a given set of in- 
puts. Typically, we take the infinity-norm of the difference between the transfer 
functions G(s) and G,.(s) obtained from the full system and reduced sys- 
tem (9.2), respectively. This norm is given by 


Glo £ max 01(G(iw)). (9.4) 


See Section for a primer on transfer functions. To summarize, we seek a 
reduced-order model (9.2) of low order, r < n, so the operator norm ||G—G,||. 
is small. 


Change of Variables in Control Systems 


The balanced model reduction problem may be formulated in terms of first 
finding a coordinate transformation, 


x = Tz, (9.5) 


that hierarchically orders the states in z in terms of their ability to capture the 
input-output characteristics of the system. We will begin by considering an in- 
vertible transformation T € R”*”, and then provide a method to compute just 
the first r columns, which will comprise the transformation W in (9.2). Thus, 
it will be possible to retain only the first r most controllable / observable states, 
while truncating the rest. This is similar to the change of variables into eigen- 
vector coordinates in (8.18), except that we emphasize controllability and ob- 
servability rather than characteristics of the dynamics. 
Substituting Tz into gives 


“12 = ATz+ Bu, (9.6a) 
y = CTz + Du. (9.6b) 
Finally, multiplying by T~ yields 
Lz = T! ATz + T !Bu, (9.7a) 
y = CTz + Du. (9.7b) 


This results in the following transformed equations: 


Lz = Âz + Êu, (9.8a) 


A 


y = Cz + Du, (9.8b) 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


440 CHAPTER 9. BALANCED MODELS FOR CONTROL 


where A = T~!AT, Ê = T~'B, and C = CT. Note that when the columns of 
T are orthonormal, the change of coordinates becomes 


oa = T*ATz + T*Bu, (9.9a) 


y = CTz + Du. (9.9b) 


Gramians and Coordinate Transformations 


The controllability and observability Gramians each establish an inner prod- 
uct on state space in terms of how controllable or observable a given state is, 
respectively. As such, Gramians depend on the particular choice of coordinate 
system and will transform under a change of coordinates. In the coordinate 
system z given by (9.5), the controllability Gramian becomes 


W.= | ~ ÂTBB*ÂT dr (9.10a) 
0 

7 J eT ATTT-IBB*T *e TAT 7 qr (9.10b) 
0 

7 i l Tote" tT BBT Te Tdr (9.10c) 
0 

= T7! ( f > e^ BB*e^ ar) 4 (9.10d) 

= ety ea (9.10e) 


Note that here we introduce T~* := (T~!)* = (T*)~!. The observability Gramian 
transforms similarly: 


a 


W, = T*W.T, (9.11) 


which is an exercise for the reader. Both Gramians transform as tensors (i.e., 
in terms of the transform matrix T and its transpose, rather than T and its in- 
verse), which is consistent with them inducing an inner product on state space. 


Simple Rescaling 


This example, modified from Moore [509], demonstrates the ability to balance 
a system through a change of coordinates. Consider the system 


d |x — i -3 
dt H 7 | i E i k | Yi (9.12a) 


y= [10 10-3] H ' (9.12b) 
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In this example, the first state xı is barely controllable, while the second state 
is barely observable. However, under the change of coordinates z, = 10°x; and 
z2 = 107x, the system becomes balanced: 


d |z —1 0 z1 1 
dt Bl 7 E ai A z h ii (9.13a) 
y=[1 1] zl (9.13b) 


In this example, the coordinate change simply rescales the state x. For instance, 
it may be that the first state had units of millimeters while the second state had 
units of kilometers. Writing both states in meters balances the dynamics; i.e., 
the controllability and observability Gramians are equal and diagonal. 


Balancing Transformations 


Now we are ready to derive the balancing coordinate transformation T that 
makes the controllability and observability Gramians equal and diagonal: 
W.=W,=. (9.14) 
First, consider the product of the Gramians from and (9.11): 
W.W, =T!'W.W.T, (9.15) 
Plugging in the desired W,.=W,= yields 
TIW.W,T =? => W.W,T = TX’. (9.16) 


The latter expression in (9.16) is the equation for the eigendecomposition of 
W.W, the product of the Gramians in the original coordinates. Thus, the bal- 
ancing transformation T is related to the eigendecomposition of W.W,. The 
expression above is valid for any scaling of the eigenvectors, and the correct 
rescaling must be chosen to exactly balance the Gramians. In other words, there 
are many such transformations T that make the product WW, = X? but 
where the individual Gramians are not equal (for example, diagonal Gramians 
W, = =. and W, = X, will satisfy ISS =): 
Below, we will introduce the matrix S = T~! to simplify notation. 


Scaling Eigenvectors for the Balancing Transformation 


To find the correct scaling of eigenvectors to make W, = W, = ®, first con- 
sider the simplified case of balancing the first diagonal element of X. Let €, 
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denote the unscaled first column of T, and let n, denote the unscaled first row 
of S =", Then 


N, W eM, = Cc, (9.17a) 
ELW oE = do. (9.17b) 


The first element of the diagonalized controllability Gramian is thus o,, while 
the first element of the diagonalized observability Gramian is ø». If we scale the 
eigenvector £, by cs, then the inverse eigenvector n, is scaled by o7'. Trans- 
forming via the new scaled eigenvectors £, = 0,€,, and n, = o7 tn, yields 


n,Wens = 05° Oc, (9.18a) 
Es Woes = CES (9.18b) 


Thus, for the two Gramians to be equal, 


= XiA 
o7 0e =0?0 == T,= (=) . (9.19) 


Oo 


To balance every diagonal entry of the controllability and observability Grami- 
ans, we first consider the unscaled eigenvector transformation T, from (9.16); 
the subscript u simply denotes unscaled. As an example, we use the standard 
scaling in most computational software so that the columns of T, have unit 
norm. Then both Gramians are diagonalized, but are not necessarily equal: 


Tow =D (9.20a) 
T*W,T, = Eo. (9.20b) 
The scaling that exactly balances these Gramians is then given by ©, = X157 "^. 
Thus, the exact balancing transformation is given by 
T=T,5.. (9.21) 


Itis possible to directly confirm that this transformation balances the Gramians: 


(T,X) IW. (T,X) = D TIW. TI Ss = = DD, 
(9.22a) 
(T,=,)*W,(T,»,) = =,T*W,.T,D, = ©.D,h, = D!1PS!/?,  (9.22b) 


The manipulations above rely on the fact that diagonal matrices commute, so 
that ©. = BX. etc. 
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Example of the Balancing Transform and Gramians 


Before confronting the practical challenges associated with accurately and ef- 
ficiently computing the balancing transformation, it is helpful to consider an 
illustrative example. 

In MATLAB, computing the balanced system and the balancing transfor- 
mation is a simple one-line command: 


o 


|TSysny oy ia, TI = balreal (sys); % Balance the system 


In this code, T is the transformation, Ti is the inverse transformation, sysb is 
the balanced system, and g is a vector containing the diagonal elements of the 
balanced Gramians. 

In Python, computing the balanced system is also simple using the Python 
Control Systems Library (python-contro DA 


| sysb = balreal(sys,len(B)); # Balance the system 


The following example illustrates the balanced realization for a two-dimensional 
system. First, we generate a system and compute its balanced realization, along 
with the Gramians for each system. Next, we visualize the Gramians of the un- 
balanced and balanced systems in Fig. 


Code 9.1: [MATLAB] Obtaining a balanced realization. 
= fo des a TS 


Sys ~- SSA B/CD); 


We = gram(sys,’c’); 
Wo = gram(sys,’0’); 


Controllability Gramian 
Observability Gramian 


o 


[sysb,g,Ti,T] = balreal (sys); 3% Balance the system 


o 


BWe = gram(sysb,’c’) % Balanced Gramians 
BWo = gram(sysb,’0o0’) 


Code 9.1: [Python] Obtaining a balanced realization. 


from control.matlab import x # Code will resemble Matlab 
import slycot 


A = np.array([[-0.75,1], [-0.3,-0.75]]) 


The Python control toolbox is available at}https://python-control.readthedocs. 
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0.5 F 


T2 


-0.5 F 


2.5 


Figure 9.2: Illustration of balancing transformation on Gramians. The reachable 
set with unit control input is shown in red, given by We?x for \|x|| = 1. The 
corresponding observable set is shown in blue. Under the balancing transfor- 
mation T, the Gramians are equal, shown in purple. 


B = np.array([2,1]).reshape((2,1)) 
nelarray PZ) 
D= 0 


sys = ss(A,B,C,D) 


We = igram(sys,'c’) # Controllability Gramian 
= gram(sys,’0o’) # Observability Gramian 


= 
(E 
| 


sysb = balred(sys,len(B)) # Balance the system 


BWc = gram(sysb,’c’) # Balanced Gramians 
BWo = gram(sysb,’0’) 


The resulting balanced Gramians are equal, diagonal, and ordered from most 
controllable / observable mode to least: 
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>>BWc = 
1.9439 -0.0000 
-0.0000 (hs SOT 

>>BWo = 
dL severe) 0.0000 
0.0000 ORS 207 


To visualize the Gramians in Fig. we first recall that the distance the 
system can go in a direction x with a unit actuation input is given by x*W-x. 
Thus, the controllability Gramian may be visualized by plotting W¿ x for x on 
a sphere with ||x|| = 1. The observability Gramian may be similarly visualized. 

In this example, we see that the most controllable and observable directions 
may not be well aligned. However, by a change of coordinates, it is possible to 
find a new direction that is the most jointly controllable and observable. It is 
then possible to represent the system in this one-dimensional subspace, while 
still capturing a significant portion of the input-output energy. If the red and 
blue Gramians were exactly perpendicular, so that the most controllable di- 
rection was the least observable direction, and vice versa, then the balanced 
Gramian would be a circle. In this case, there is no preferred state direction, 
and both directions are equally important for the input-output behavior. 

Instead of using the balreal command, it is possible to manually construct 
the balancing transformation from the eigendecomposition of W.W,, as de- 
scribed above and provided in code available online. 


Balanced Truncation 


We have now shown that it is possible to define a change of coordinates so that 
the controllability and observability Gramians are equal and diagonal. More- 
over, these new coordinates may be ranked hierarchically in terms of their joint 
controllability and observability. It may be possible to truncate these coordi- 
nates and keep only the most controllable/observable directions, resulting in a 
reduced-order model that faithfully captures input-output dynamics. 

Given the new coordinates z = T~!x € R”, itis possible to define a reduced- 
order state x € R” as 


Z1 


xK 


Z= l (9.23) 


en, 
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in terms of the first r most controllable and observable directions. If we parti- 
tion the balancing transformation T and inverse transformation S = T~! into 
the first r modes to be retained and the last n — r modes to be truncated, 


aie tl = ke | ; (9.24) 
then it is possible to rewrite the transformed dynamics in (9.7) as 
d[x]_ [@®*AW|®*AT,] [x] [ 2B 
dt B 7 | S,AW | S,AT; | |= | | S,B Ju (9.25a) 


y= | CĦ | CT; | Ž + Du. (9.25b) 


In balanced truncation, the state z; is simply truncated (i.e., discarded and set 
equal to zero), and only the x equations remain: 


“x = &* AVX + S*Bu, (9.26a) 


y = CWX+ Du. (9.26b) 


Only the first r columns of T and of S* = T™ are required to construct Y 
and ®, and thus computing the entire balancing transformation T is unneces- 
sary. Note that the matrix ® here is different than the matrix of DMD modes in 
Section]7.2] The computation of Y and ® without T will be discussed in the fol- 
lowing sections. A key benefit of balanced truncation is the existence of upper 
and lower bounds on the error of a given order truncation: 


upper bound ||G—G,||.. <2 ` Oj, (9.27a) 
j=r+1 
lower bound ||G— G,llo > Cr+1, (9.27b) 


where g; is the jth diagonal entry of the balanced Gramians. The diagonal en- 
tries of © are also known as Hankel singular values. 


Computing Balanced Realizations 


In the previous section we demonstrated the feasibility of obtaining a coordi- 
nate transformation that balances the controllability and observability Grami- 
ans. However, the computation of this balancing transformation is non-trivial, 
and significant work has gone into obtaining accurate and efficient methods, 
starting with Moore in 1981 [509], and continuing with Lall, Marsden, and 
Glavaški in 2002 [426], Willcox and Peraire in 2002 [755], and Rowley in 2005 
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[608]. For an excellent and complete treatment of balanced realizations and 
model reduction, see Antoulas [24]. 

In practice, computing the Gramians W. and W, and the eigendecomposition 
of the product W.W, in may be prohibitively expensive for high-dimensional 
systems. Instead, the balancing transformation may be approximated from impulse- 
response data, utilizing the singular value decomposition for efficient extrac- 
tion of the most relevant subspaces. 

We will first show that Gramians may be approximated via a snapshot ma- 
trix from impulse-response experiments /simulations. Then, we will show how 
the balancing transformation may be obtained from this data. 


Empirical Gramians 


In practice, computing Gramians via the Lyapunov equation is computation- 
ally expensive, with computational complexity of O(n*). Instead, the Gramians 
may be approximated by full-state measurements of the discrete-time direct 
and adjoint systems: 


direct Xk+1 = AGXk + Baug, (9.28a) 
adjoint x,4; = Ajx;, + Cyr. (9.28b) 


Equation is the discrete-time dynamic update equation from (8.21), and 
is the adjoint equation. The matrices Aq, Ba, and C4 are the discrete- 
time system matrices from (8.22). Note that the adjoint equation is generally 
non-physical, and must be simulated; thus the methods here apply to analyt- 
ical equations and simulations, but not to experimental data. An alternative 
formulation that does not rely on adjoint data, and therefore generalizes to ex- 
periments, will be provided in Section|9.3| 

Computing the impulse response of the direct and adjoint systems yields 
the following discrete-time snapshot matrices: 


Ca= |Ba ABa © ABa], Oa= , . (9.29) 
CAC 
Note that when m, = n, C4 is the discrete-time controllability matrix, and when 
Mo = n, Og is the discrete-time observability matrix; however, we generally 
consider Me, Mo K n. These matrices may also be obtained by sampling the 
continuous-time direct and adjoint systems at a regular interval At. 

It is now possible to compute empirical Gramians that approximate the true 

Gramians without solving the Lyapunov equations in (8.42) and (8.43): 
W. ~ We = CC}, (9.30a) 
W, ~x W5 = O30. (9.30b) 
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The empirical Gramians essentially comprise a Riemann sum approximation 
of the integral in the continuous-time Gramians, which becomes exact as the 
time-step of the discrete-time system becomes arbitrarily small and the dura- 
tion of the impulse response becomes arbitrarily large. In practice, the impulse- 
response snapshots should be collected until the lightly damped transients die 
out. The method of empirical Gramians is quite efficient, and is widely used 
[755]. Note that p adjoint impulse responses are required, 
where p is the number of outputs. This becomes intractable when there are a 
large number of outputs (e.g., full-state measurements), motivating the output 
projection below. 


Balanced POD 


Instead of computing the eigendecomposition of W.W,, which is an n x n 
matrix, it is possible to compute the balancing transformation via the singular 
value decomposition of the product of the snapshot matrices, 


OaCa, (9.31) 


reminiscent of the method of snapshots from Section [1.3] [663]. This is the ap- 
proach taken by Rowley [608]. 

First, define the generalized Hankel matrix as the product of the adjoint 
(O4) and direct (C4) snapshot matrices from (9.29), for the discrete-time system: 


Ca 
H = Oca = aa [Ba ABa «+» A™ By] (9.32a) 
Can 
C.Ba CuAaBa © CaA Ba 
z CaAaBa CaA Ba i Caa Ba (9.32b) 
CA" B, CA” Bua ” CuAmtm 2B, 
Next, we factor H using the SVD: 
H=Urv*=[U Ú; E a hal ~UrV. (9.33) 


For a given desired model order r < n, only the first r columns of U and V are 
retained, along with the first r x r block of &; the remaining contribution from 


U,V; may be truncated. This yields a bi-orthogonal set of modes given by: 
direct modes W = awe? (9.34a) 


adjoint modes ®= ous”. (9.34b) 
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The direct modes Ų € R"*" and adjoint modes ® € R”*" are bi-orthogonal, 
®*v = Ix, and Rowley showed that they establish the change of coor- 
dinates that balance the truncated empirical Gramians. Thus, & approximates 
the first r columns of the full n x n balancing transformation, T, and ®* approx- 
imates the first r rows of the n x n inverse balancing transformation, S = T~". 

Now, it is possible to project the original system onto these modes, yielding 
a balanced reduced-order model of order r: 


A= ®A,W, (9.35a) 
B = ®*B,, (9.35b) 
C=C. (9.35c) 


It is possible to compute the reduced system dynamics in without hav- 
ing direct access to Ag. In some cases, Ag may be exceedingly large and un- 
wieldy, and instead it is only possible to evaluate the action of this matrix on 
an input vector. For example, in many modern fluid dynamics codes, the ma- 
trix A, is not actually represented, but because it is sparse, it is possible to 
implement efficient routines to multiply this matrix by a vector. 

It is important to note that the reduced-order model in is formulated 
in discrete time, as it is based on discrete-time empirical snapshot matrices. 
However, it is simple to obtain the corresponding continuous-time system: 


Discrete-time 
Continuous-time 


>>sysD = ss (Atilde, Btilde, Ctilde,D,dt); 
>>sysC = d2c(sysD); 


D 
3 


In this example, D is the same in continuous time and discrete time, and in the 
full-order and reduced-order models. 

Note that a BPOD model may not exactly satisfy the upper bound from 
balanced truncation (see (9.27)) due to errors in the empirical Gramians. 


Output Projection 


Often, in high-dimensional simulations, we assume full-state measurements, so 
that p = n is exceedingly large. To avoid computing p = n adjoint simulations, 
it is possible instead to solve an output-projected adjoint equation [608]: 


Xe = Aix, + CyUy, (9.36) 


where U is a matrix containing the first r singular vectors of Ca. Thus, we first 
identify a low-dimensional POD subspace U from a direct impulse response, 
and then only perform adjoint impulse-response simulations by exciting these 
few POD coefficient measurements. More generally, if y is high-dimensional but 
does not measure the full state, it is possible to use a POD subspace trained 
on the measurements, given by the first r singular vectors U of CyCq. Adjoint 
impulse responses may then be performed in these output POD directions. 
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Data Collection and Stacking 


The powers m. and m, in (9.32) signify that data must be collected until the 
matrices C4 and O} are full rank, after which the controllable /observable sub- 
spaces have been sampled. Unless we collect data until transients decay, the 
true Gramians are only approximately balanced. Instead, it is possible to col- 
lect data until the Hankel matrix is full rank, balance the resulting model, and 
then truncate. This more efficient approach is developed in and [462]. 

The snapshot matrices in are generated from impulse-response simu- 
lations of the direct and adjoint systems. These time-series snap- 
shots are then interleaved to form the snapshot matrices. 


Historical Note 


The balanced POD method described above originated with the seminal work 
of Moore in 1981 [509], which provided a data-driven generalization of the min- 
imal realization theory of Ho and Kalman [329]. Until then, minimal realiza- 
tions were defined in terms of idealized controllable and observable subspaces, 
which neglected the subtlety of degrees of controllability and observability. 

Mootre’s paper introduced a number of critical concepts that bridged the gap 
from theory to reality. First, he established a connection between principal com- 
ponent analysis (PCA) and Gramians, showing that information about degrees 
of controllability and observability may be mined from data via the SVD. Next, 
Moore showed that a balancing transformation exists that makes the Grami- 
ans equal, diagonal, and hierarchically ordered by balanced controllability and 
observability; moreover, he provides an algorithm to compute this transfor- 
mation. This set the stage for principled model reduction, whereby states may 
be truncated based on their joint controllability and observability. Moore fur- 
ther introduced the notion of an empirical Gramian, although he did not use 
this terminology. He also realized that computing W. and W, directly is less 
accurate than computing the SVD of the empirical snapshot matrices from the 
direct and adjoint systems, and he avoided directly computing the eigendecom- 
position of W.W, by using these SVD transformations. In 2002, Lall, Marsden, 
and Glavaski generalized this theory to nonlinear systems. 

One drawback of Moore’s approach is that he computed the entire n x n bal- 
ancing transformation, which is not suitable for exceedingly high-dimensional 
systems. In 2002, Willcox and Peraire generalized the method to high- 
dimensional systems, introducing a variant based on the rank-r decomposi- 
tions of W, and W, obtained from the direct and adjoint snapshot matrices. It 
is then possible to compute the eigendecomposition of W.W, using efficient 
eigenvalue solvers without ever actually writing down the full n x n matri- 
ces. However, this approach has the drawback of requiring as many adjoint 
impulse-response simulations as the number of output equations, which may 
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be exceedingly large for full-state measurements. In 2005, Rowley ad- 
dressed this issue by introducing the output projection, discussed above, which 
limits the number of adjoint simulations to the number of relevant POD modes 
in the data. He also showed that it is possible to use the eigendecomposition of 
the product OqCq. The product O7Cq is often smaller, and these computations 
may be more accurate. 

It is interesting to note that a nearly equivalent formulation was developed 
20 years earlier in the field of system identification. The so-called eigensystem 
realization algorithm (ERA) [358], introduced in 1985 by Juang and Pappa, ob- 
tains equivalent balanced models without the need for adjoint data, making it 
useful for system identification in experiments. This connection between ERA 
and BPOD was established by Ma et al. in 2011 [468]. 


Balanced Model Reduction Example 


In this example we will demonstrate the computation of balanced truncation 
and balanced POD models ona random state-space system with n = 100 states, 
q = 2 inputs, and p = 2 outputs. First, we generate a system: 

Gq = 25 * Number of inputs 

p= 2; @ Number of outputs 

n = 100; %# State dimension 

Sye EUN = rss (opra); 2 Discrece random syscem 


Next, we compute the Hankel singular values, which are plotted in Fig.|9.3| We 
see that r = 10 modes capture over 90% of the input-output energy. 


|hsvs = hsvd(sysFull); % Hankel singular values 
Now we construct an exact balanced truncation model with order r = 10: 


$3 Exact balanced truncation 
SysBT = balred(sysFull,r); % Balanced truncation 


The full-order system, and the balanced truncation and balanced POD mod- 
els are compared in Fig. |9.4] The code used to generate a BPOD model is avail- 
able in MATLAB and Python on the book’s GitHub. It can be seen that the bal- 
anced model accurately captures the dominant input-output dynamics, even 
when only 10% of the modes are kept. 


9.3 System Identification 


In contrast to model reduction, where the system model (A, B, C, D) was known, 
system identification is purely data-driven. System identification may be thought 
of as a form of machine learning, where an input-output map of a system is 
learned from training data in a representation that generalizes to data that was 
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Figure 9.3: Hankel singular values (left) and cumulative sum (right) for random 
state-space system with n = 100 and p = q = 2. The first r = 10 Hankel singular 
values contain 92.9% of the cumulative sum. 
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Figure 9.4: Impulse response of full-state model with n = 100 and p = q = 2, 
along with balanced truncation and balanced POD models with r = 10. 


not in the training set. There is a vast literature on methods for system iden- 
tification 451], and many of the leading methods are based on a form of 
dynamic regression that fits models based on data, such as the DMD from Sec- 
tion For this section, we consider the eigensystem realization algorithm 
(ERA) and observer Kalman filter identification (OKID) methods because of 
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their connection to balanced model reduction and their suc- 
cessful application in high-dimensional systems such as vibration control of 
aerospace structures and closed-loop flow control [38}/39] [345]. The ERA/OKID 
procedure is also applicable to multiple-input, multiple-output (MIMO) sys- 
tems. Other methods include the autoregressive moving average (ARMA) and 
autoregressive moving average with exogenous inputs (ARMAX) models 
751], the nonlinear autoregressive moving average with exogenous inputs (NAR- 
MAX) model, and the SINDy method from Section|7.3| 


Eigensystem Realization Algorithm 


The eigensystem realization algorithm (ERA) produces low-dimensional linear 
input-output models from sensor measurements of an impulse-response ex- 
periment, based on the “minimal realization” theory of Ho and Kalman [829]. 
The modern theory was developed to identify structural models for various 
spacecraft [358], and it has been shown by Ma et al. that ERA models 
are equivalent to BPOD models] However, ERA is based entirely on impulse- 
response measurements and does not require prior knowledge of a model. 
We consider a discrete-time system, as described in Section [8.2} 


Xk+1 = Agx, + B4ux, (9.37a) 
Yk = CaXk + Duk. (9.37b) 


A discrete-time delta function input in the actuation u, 


I, k=0, 


uf 2 what) = {5 a oe (9.38) 
gives rise to a discrete-time impulse response in the sensors y: 
D k=0 
6 A Ô _ ds ’ 
P= y A= een a ae (a 


In an experiment or simulation, typically q impulse responses are performed, 
one for each of the q separate input channels. The output responses are collected 
for each impulsive input, and, at a given time-step k, the output vector in re- 
sponse to the jth impulsive input will form the jth column of y?. Thus, each of 
the y? is a p x q matrix CA*~'B. Note that the system matrices (A, B, C, D) do 
not actually need to exist, as the method below is purely data-driven. 

The Hankel matrix H from is formed by stacking shifted time series of 
impulse-response measurements into a matrix, as in the HAVOK method from 


3BPOD and ERA models both balance the empirical Gramians and approximate balanced 
truncation [509] for high-dimensional systems, given a sufficient volume of data. 
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Section 7.5} 
yi y? er y 
H = r y3 i ? Yet ean 
Ving, se, St NO tit cg 
CaBa CaAgBa °°: CA” |B, 
7 a“ aa M a = (9.40b) 
CJA Bu CIA” Bu ie l Camm 2B, 


The matrix H may be constructed purely from measurements yê, without sep- 
arately constructing O, and C4. Thus, we do not need access to adjoint equa- 
tions. 

Taking the SVD of the Hankel matrix yields the dominant temporal patterns 
in the time-series data: 


H=UzrvV*=[U U,] o J A UAV”, (9.41) 
t 


The small singular values in X; are truncated, and only the first r singular val- 
ues in © are retained. The columns of U and V are eigen-time-delay coordinates. 

Until this point, the ERA algorithm closely resembles the BPOD procedure 
from Section However, we do not require direct access to Og and C4 or 
the system (A, B,C, D) to construct the direct and adjoint balancing transfor- 
mations. Instead, with sensor measurements from an impulse-response exper- 
iment, it is also possible to create a second, shifted Hankel matrix H’: 


y y: a Yh 

aN a 8 ee (9.42a) 
Feti Faa ae ee 
CyAgBa CgA2By -:: CA7" Ba 

a| ABa GAB ee A De | _ go (0.426) 
CA" B, CA” Ba - Camino, 


Based on the matrices H and H’, we are able to construct a reduced-order 
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model as follows: 


-SVHS (9.43a) 
B=5°V" ol , (9.43b) 
G=[1, 0) 0s” (9.43c) 


Here I, is the q x q identity matrix, which extracts the first q columns, and I, is 
the p x p identity matrix, which extracts the first p rows. Alternatively, B and C 
may be computed using the fact that H ~ USV> as 


-5 "0A H , (9.44a) 
(9.44b) 


Thus, we express the input-output dynamics in terms of a reduced system with 
a low-dimensional state x € R": 


Xn. = AX, + Bu, (9.45a) 
y = CX. (9.45b) 


The Hankel matrices H and H’ are constructed from impulse-response sim- 
ulations /experiments, without the need for storing direct or adjoint snapshots, 
as in other balanced model reduction techniques. However, if full-state snap- 
shots are available, for example, by collecting velocity fields in simulations or 
particle image velocimetry (PIV) experiments, it is then possible to construct di- 
rect modes. These full-state snapshots form C4, and modes can be constructed 
by 


~ ~—1/2 


T2¢VE (9.46) 


These modes may then be used to approximate the full state of the high-dimensional 
system from the low-dimensional model in (9.45) by 


xz Wx, (9.47) 


If enough data is collected when constructing the Hankel matrix H, then 
ERA balances the empirical controllability and observability Gramians, OgO% 
and C%Cq. However, if less data is collected, so that lightly damped transients 
do not have time to decay, then ERA will only approximately balance the sys- 
tem. It is instead possible to collect just enough data so that the Hankel matrix 
H reaches numerical full rank (i.e., so that remaining singular values are below 
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a threshold tolerance), and compute an ERA model. The resulting ERA model 
will typically have a relatively low order, given by the numerical rank of the 
controllability and observability subspaces. It may then be possible to apply 
exact balanced truncation to this smaller model, as is advocated in and 
|462]. 

The code to compute ERA is provided in Code[9.2|below. Large portions of 
the code that format the input data into a Hankel matrix are omitted, but can 
be found on the book’s GitHub. 


Code 9.2: [MATLAB] Eigensystem realization algorithm. 


function [Ae Br; Cr, Dr ASV S| = BRAY m, Noin, Noutri) 


% Code to format data into Hankel matrix H omitted 


[U,S,V] = svd(H,’econ’); 

Sigma — =o) lic, asa). 

Ur = U palsies ice 

VE = NV Ey 

ae edoma Daur HVO rgma A 5), 
(ie oigna Nee oA UEH C a)y 

Ce Hnutia a)y 

HSVs = diag (S); 


Code 9.2: [Python] Eigensystem realization algorithm. 


def BRAY. mmr, Nout, r): 
# Code to format data into Hankel matrix H omitted 


U,S,VT = np.linalg.svd(H, full_matrices=0) 

V = VEL 

Sigma = np.diag(S[:r]) 

Ue = SO [Eset 

VE = Vile 

Ar = fractional_matrix_power(Sigma,-0.5) @ Ur.T @ H2 @ 
Ve @ fractional matrix power (Sigma, 0-5) 


Br fractional matrix power (Sigma, 0.5) @ Ur. Bis, 
nin] 

Cr = H[:nout,:] @ Vr @ fractional_matrix_power (Sigma 
TORD) 

HSVs = 5 
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Figure 9.5: Schematic overview of OKID procedure. The output of OKID is an 
impulse response that can be used for system identification via ERA. 


Observer Kalman Filter Identification 


OKID, illustrated in Fig. was developed to complement the ERA for lightly 
damped experimental systems with noise [359]. In practice, performing iso- 
lated impulse-response experiments is challenging, and the effect of measure- 
ment noise can contaminate results. Moreover, if there is a large separation of 
timescales, then a tremendous amount of data must be collected to use ERA. 
This section poses the general problem of approximating the impulse response 
from arbitrary input-output data. Typically, one would identify reduced-order 
models according to the following general procedure: 


1. Collect the output in response to a pseudo-random input. 


2. This information is passed through the OKID algorithm to obtain the de- 
noised linear impulse response. 


3. The impulse response is passed through the ERA to obtain a reduced- 
order state-space system. 


The output y;, in response to a general input signal u,, for zero initial con- 
dition xo = 0, is given by 


yo = Dau, (9.48a) 
yı = CaBauo + Dat, (9.48b) 
y2 = CygAqgBauo + CaBau; + Daun, (9.48c) 
Yt = CA‘ Bao + CA‘ Bau free Hf C,Bauy,_1 + Dug. (9.48d) 
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Note that there is no C term in the expression for yo since there is zero initial 
condition x) = 0. This progression of measurements y, may be further simpli- 
fied and expressed in terms of impulse-response measurements y?: 


Up U Um 
0 uo Um-1 
[yo yi e Ym = ly y? = yà]. o. (9.49) 
ee ama — aa : : = : 
S s’ 0 0 uo 
B 


It is often possible to invert the matrix of control inputs, B, to solve for the 
Markov parameters S’. However, B either may be un-invertible, or inversion 
may be ill conditioned. In addition, B is large for lightly damped systems, mak- 
ing inversion computationally expensive. Finally, noise is not optimally filtered 
by simply inverting B to solve for the Markov parameters. 

The OKID method addresses each of these issues. Instead of the original 
discrete-time system, we now introduce an optimal observer system: 


Ket = Adk + K;(yk — Yx) + Baur, (9.50a) 
Yr = Cax, + Daur, (9.50b) 


which may be rewritten as 


Raya = (Aa — KyCy) & + [Ba - K;Da, Ky] H (9.51) 
Aa Ba 


Recall from above that if, the system is observable, it is possible to place the 
poles of A4 — KyCg anywhere we like. However, depending on the amount of 
noise in the measurements, the magnitude of process noise, and the uncertainty 
in our model, there are optimal pole locations that are given by the Kalman ster 
(recall Section |8.5). We may now solve for the observer Markov parameters S° of 
the system in (9.51) in terms of measured inputs and outputs according to the 
following algorithm from [359]: 


1. Choose the number of observer Markov parameters to identify, l. 


2. Construct the data matrices below: 


S = [yo yı PEEN yı PEE Ym] ; (9.52) 
Up Wy °::: u; Eaa Um 
(0) yV a yV ta oon & Vm- 

velo 2 0 a (9.53) 
(0) (0) Vo Vm-l 
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where v; = [u; ale 

The matrix VY resembles B, except that it has been augmented with the 
outputs y;. In this way, we are working with a system that is augmented 
to include a Kalman filter. We are now identifying the observer Markov 
parameters of the augmented system, S°, using the equation S = S°V. It 
will be possible to identify these observer Markov parameters from the 
data and then extract the impulse response (Markov parameters) of the 


original system. 


3. Identify the matrix S° of observer Markov parameters by solving S = 
S°V for S° using the right pseudo-inverse of V (i.e., SVD). 


4. Recover system Markov parameters, S°, from the observer Markov pa- 


rameters, S°: 
(a) Order the observer Markov parameters S° as 
5) = D, (9.54) 
55 = (5°) © (5| fork >1, (9.55) 


where (S°)\” € RP*4, (S°) € RP*?, and yf = S$ = D. 


(b) Reconstruct system Markov parameters as 


yh = (8) + (SP yi, fork > 1. (9.56) 


Thus, the OKID method identifies the Markov parameters of a system aug- 
mented with an asymptotically stable Kalman filter. The system Markov pa- 
rameters are extracted from the observer Markov parameters by (9.56). These 
system Markov parameters approximate the impulse response of the system, 
and may be used directly as inputs to the ERA algorithm. A code to compute 
OKID is provided in both MATLAB and Python on the book’s GitHub. 

ERA/OKID has been widely applied across a range of system identification 
tasks, including to identify models of aeroelastic structures and fluid dynamic 
systems. There are numerous extensions of the ERA/OKID methods. For ex- 
ample, there are generalizations for linear parameter varying (LPV) systems 
and systems linearized about a limit cycle. 


Combining ERA and OKID 


Here we demonstrate ERA and OKID on the same model system from Sec- 
tion [9.2 Because ERA yields the same balanced models as BPOD, the reduced 
system responses should be the same. 
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First, Code [9.3] computes an impulse response of the full system, and uses 
this as an input to ERA. 


Code 9.3: [MATLAB] Compute impulse response and use ERA to generate 
model. 

%3 Obtain impulse response of full system 

[yeui | = ampullse(syshully Oc Gexoy HIN; 

Y = permute(yFull, [2 3 1]); Reorder to be size p xX q xm 
(default is mix (px g) 


Y 


a 
a 


3% Compute ERA from impulse response 
mco = floor ((length (yFull)-1)/2); sma: Mao = (m l)72 
Ae Be Cr, DE Asval = ERA (YY meo, Meor rnumInput o, 1UMOUE putes, E); 


SySERA — s3 (Ar Br; Cr; DE 1); 


Code 9.3: [Python] Compute impulse response and use ERA to generate model. 
for qi in range(q): 

yFull[:,:,qi],t = impulse (SysFull  T=tSpan, input=qi) 
YY = np.transpose(yFull,axes=(1,2,0)) # reorder to p x q x m 


## Compute ERA from impulse response 

mco = int (np.floor((yFull.shape[0]-1)/2)) # m_c=m_o=(m-1)/2 
Ar, Br, Cr,Dr,HsVs = ERA (YY meo; Meo, dr 0, ©) 

SYSERA = ss (Ar, Be, Cr, Dr, T) 


Next, if an impulse response is unavailable, it is possible to excite the system 
with a random input signal and use OKID to extract an impulse response. This 
impulse response is then used by ERA to extract the model, as in Code[9.4] 


Code 9.4: [MATLAB] Approximate impulse response with OKID and use ERA 
to generate model. 

3% Compute random input simulation for OKID 

uRandom = randn(numiInputs,200); % Random forcing input 


o 


yRandom = lsim(sysFull,uRandom,1:200)’'; %7 Output 


@% Compute OKID and then ERA 

H = OKID(yRandom, uRandom, r) ; 

mco = floor ( (length (H)-1)/2); $m_c = m_o 
[Ar,Br,Cr,Dr,HSVs] = BRA(H,mco,mco,numInputs,numOutputs,r); 
SyseRAOKID = S6(Ar,Br,Cr,Dr,—1); 


Code 9.4: [Python] Approximate impulse response with OKID and use ERA to 
generate model. 


## Compute random input simulation for OKID 
uRandom = np.random.randn(q,200) # Random forcing input 
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Figure 9.6: Input-output data used by OKID. 


yRandom = lsim(sysFull,uRandom, range(200))[0].T # Output 


## Compute OKID and then ERA 

H = OKID(yRandom, uRandom, r) 

meo = int (np.floor((H.shape[2]-1)/2)) # m_c = m_o 
Ar, Be, Cre, De Hove: = ERA (H mee, meod OE) 
SYSERAOKID = ss(Ar,Br,Cr,Dr,1) 


Figure [9.6|shows the input-output data used by OKID to approximate the 
impulse response. The impulse responses of the resulting systems are com- 


puted as in Code 


Code 9.5: [MATLAB] Plot impulse responses of various models. 
ale] impulse (sysFull,0:1:200); 
[y2, 62] = impulse (sysERA, 071 51.00); 
[yee 3] impulse (sySERAOKID,0:1:100); 


ll 


Code 9.5: [Python] Plot impulse responses of various models. 


for gi in range(q): 


yl[:,:,qi],tl = impulse(sysFull,np.arange (200) ,input=qi) 
y2[:,:,qil]l,t2 = impulse (sysERA,np.arange (100) ,input=qi) 
Vo(G7 3, auilipes = impulse (sysERAOKITD, np. arange (10:0), Input- 


qi) 

Finally, the system responses can be seen in Fig.|9.7| The low-order ERA and 
ERA/OKID models closely match the full model and have similar performance 
to the BPOD models described above. Because ERA and BPOD are mathe- 
matically equivalent, this agreement is not surprising. However, the ability of 
ERA/OKID to extract a reduced-order model from the random input data in 
Fig. [9.6]is quite remarkable. Moreover, unlike BPOD, these methods are readily 
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Figure 9.7: Impulse response of full-state model with n = 100 and p = q = 2, 
along with ERA and ERA/OKID models with r = 10. 


applicable to experimental measurements, as they do not require non-physical 
adjoint equations. 
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Homework 


Exercise 9-1. Generate a random state-space system, as in the example used to 
generate Fig. For various model orders r, compute the exact balanced trun- 
cation model at this order. Compute the relative error between the truncated 
model and the full model in at least two different system norms; there are sev- 
eral system norms, such as the 2-norm and the infinity-norm. Plot the various 
relative error norms versus the model order r. 


Now compute the relative Frobenius norm error of the truncated model im- 
pulse response compared to the impulse response of the full model; to compute 
the Frobenius norm, simply vectorize the impulse-response data and compute 
the 2-norm of the resulting vector. Plot the relative Frobenius norm error versus 
the model order r, and compare the with the system norms above. 


Exercise 9-2. For the exercise above, we will now compare balanced truncation 
with output truncation, input truncation, and truncation based on the eigen- 
values of the A matrix. 


For output truncation, you will compute the truncated projection basis by com- 
puting the SVD of the observability matrix and retaining the states correspond- 
ing to the first r singular vectors; alternatively, you may use the first r leading 
eigenvectors of the observability Gramian. For input truncation, you will do 
the same thing, but for the controllability matrix (respectively, controllability 
Gramian). For truncation based on the eigenvalues of A, you will compute a 
truncated projection basis using the eigenvectors of A, with the retained states 
corresponding to the r least-damped eigenvalues (i.e., the eigenvalues with the 
most positive or least negative real part for continuous-time dynamics, or the 
eigenvalues with the largest radius for discrete-time dynamics). In this case, 
you may want to manually order the eigenvalues and eigenvectors. 


In all cases above, for a given model order r, you will use the computed basis to 
truncate the system. Reproduce Fig.|9.4}with these new forms of truncation and 
discuss the results. Also create a plot of the relative error between the truncated 
models and the full model versus model order r. 


Exercise 9-3. This exercise will explore balanced residualization, which is an al- 
ternative to balanced truncation. 


(a) First, create a single-input, single-output random state-space system with 
n = 100. Now, plot the frequency response (i.e., Bode plot) of the full 
model and the balanced truncation models for various model orders r. 
Comment on how they agree and disagree. 


(b) In balanced residualization, instead of truncating the z; state in (9.25) en- 
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(c) 


(d) 


tirely, the equation z, = 0 is solved for z,, and this is substituted into the 
x and y equations. 


Write down the balanced residualization equations based on (9.25). Use 
this formulation to compute the balanced residualization model of order 
r = 10 and reproduce Fig.|9.4] with this new residualized model. You can 
also use the “Mat chDC” option in MATLAB to check your results. 


Plot the frequency response of the full model and the balanced residual- 
ization models for various model orders r. Discuss how these differ from 
the balanced truncation models. 


Exercise 9-4. This exercise will explore the ERA/OKID procedure for model 
identification, including how different input forcing signals affect the ability of 
ERA/OKID to identify a model. 


(a) 


(b) 


(c) 


(d) 


(e) 


First, reproduce the ERA/OKID results in Fig.|9.7|for model order r = 10. 
Now, compute the relative Frobenius norm error between the full model 
and each of the ERA and ERA/OKID models for all model orders r be- 
tween r = 1 andr = 10, as well as model orders r = 20, r = 30, r = 40, and 
r = 50. To compute the Frobenius norm of the multiple-input, multiple- 
output impulse response, simply vectorize the impulse-response data and 


compute the 2-norm of the vector. Plot the relative error versus model or- 
der. 


Now, add a small amount of Gaussian white noise to the randomly forced 
data that is input to the ERA/OKID procedure. How does the noise mag- 
nitude affect the fidelity of the models of various order? 


Repeat the ERA/OKID results in Fig. (9.7) for model order r = 10, but in- 
stead of Gaussian white noise forcing for u, use a quadratic chirp signal as 
in (2.51). Compare the impulse response of the ERA/OKID system with 
that of the full model. 


Repeat part (c) above, but using a pseudo-random sequence of step func- 
tions in the control u; i.e., the control input steps to different random val- 
ues at randomly timed spacing with minimum time between changes of 
5. Does the model fidelity change when the sequence of step functions 
can have different amplitudes versus when they are constrained to have 
the same amplitude of either 0 or 1? 


Repeat part (c) above using a pure tone sine wave for the control u. Now, 
instead of a pure tone sine wave, use a square-wave pulse train with the 
same frequency. Explain the similarities or differences in the identified 
models. Compute Bode plots for each identified model and compare with 
the true Bode plot. 
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Exercise 9-5. Download a system matrix from the SLICOT benchmark website 


http://slicot.org/20-site/126-benchmark-examples-—for-model-reduction 


Repeat all of Exercise 9-4 above on this test system. 


Exercise 9-6. In this exercise, we will explore how ERA handles spatio-temporal 
data, such as a decaying traveling wave. 


Generate data for a traveling, decaying wave pulse 
f(a,t) =e exp"? sin(w2), 
shown below. Begin with the parameters à = —0.05, c = 1, and w = 20. Use 


a spatial domain of x € [—5,15] with Az = 0.05 and a temporal domain of 
t € [0, 10] with At = 0.05. Plot this data, either as a movie or as a waterfall plot. 


-5 0 5 10 15 


Now, use this data to train ERA models of various orders. Explore the perfor- 
mance of these models at different orders. Explain your results. 


Exercise 9-7. Describe the connections between ERA and DMDc. How are the 
algorithms connected? 
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Chapter 10 


Data-Driven Control 


As described in Chapter|8} control design often begins with a model of the sys- 
tem being controlled. Notable exceptions include model-free adaptive control 
strategies, reinforcement learning, and many uses of proportional—integral— 
derivative (PID) control. For mechanical systems of moderate dimension, it 
may be possible to write down a model (e.g., based on the Newtonian, La- 
grangian, or Hamiltonian formalism) and linearize the dynamics about a fixed 
point or periodic orbit. However, for modern systems of interest, as are found 
in neuroscience, turbulence, epidemiology, climate, and finance, typically there 
are no simple models suitable for control design. Chapter g described tech- 
niques to obtain control-oriented reduced-order models for high-dimensional 
systems from data, but these approaches are limited to linear systems. Real- 
world systems are usually nonlinear and the control objective is not readily 
achieved via linear techniques. Nonlinear control can still be posed as an op- 
timization problem with a high-dimensional, non-convex cost function land- 
scape with multiple local minima. Modern data-driven methods, such as ma- 
chine learning, are complementary, as they constitute a growing set of tech- 
niques that may be broadly described as performing nonlinear optimization in 
a high-dimensional space from data. 

This chapter describes data-driven control techniques that are specifically 
designed for systems that lack a principled model. Thus, we describe emerging 
techniques that use machine learning to characterize and control strongly non- 
linear, high-dimensional, and multi-scale systems, leveraging the increasing 
availability of high-quality measurement data. Machine learning techniques 
may be used (1) to characterize a system for later use with model-based control, 
or (2) to directly characterize a control law that effectively interacts with a sys- 
tem. This is illustrated schematically in Fig. where data-driven techniques 
may be applied to either the System or Controller blocks. Related methods may 
also be used to identify good sensors and actuators, as discussed previously in 
Section B.8| Section [10.1|will introduce model predictive control (MPC), which 
is a powerful and flexible approach for controlling nonlinear systems with con- 
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Sensors 


Controller 


Figure 10.1: In the standard control framework from Chapter|8} machine learn- 
ing may be used (1) to develop a model of the system or (2) to learn a controller. 


straints and uncertainty. However, MPC relies on a system model, and so Sec- 
tion{10.2| demonstrates how to use machine learning and system identification 
to learn nonlinear input-output models that may be used with MPC. In Sec- 
tion{10.3|we explore machine learning techniques to directly identify controllers 
from input-output data. Here we explore the use of genetic algorithms to learn 
control laws, as demonstrated on a simple example of tuning a PID controller. 
It is important to emphasize the breadth and depth of this field, and there are 
many powerful methods, including reinforcement learning, which is the sub- 
ject of Chapter|11] Finally, in Section|10.4] we describe the adaptive extremum- 
seeking control strategy, which optimizes the control signal based on how the 
system responds to perturbations. 


10.1 Model Predictive Control (MPC) 


Model predictive control (MPC) has 
become a cornerstone of modern process control and is ubiquitous in the in- 
dustrial landscape. MPC is used to control strongly nonlinear systems with 
constraints, time delays, non-minimum-phase dynamics, and instability. Most 
industrial applications of MPC use empirical models based on linear system 
identification (see Chapter (8), neural networks (see Chapter (6), Volterra series 
[102] [118], and autoregressive models [8] (e.g., ARX, ARMA, NARX, and NAR- 
MAX). Recently, deep learning and reinforcement learning have been com- 
bined with MPC with impressive results. However, deep learning 
requires large volumes of data and may not be readily interpretable. A com- 
plementary line of research seeks to identify models for MPC based on lim- 
ited data to characterize systems in response to abrupt changes. For example, 
Kaiser et al. recently showed that it is possible to rapidly identify DMD 
and SINDy models from Chapter {7} based on limited data, and then use these 
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Measurement 


Model 


Xj 


+ 


Surrogate 
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Šk+41 = F(x;, uj) 


MPC 
optimizer 


Figure 10.2: Schematic of model predictive control, where a surrogate model is 
used to run an optimization directly inside the control loop. This diagram as- 
sumes full-state measurements y = x for simplicity, although this is not strictly 
necessary for MPC. 


for MPC. 

Model predictive control is shown schematically in Fig. MPC deter- 
mines the next immediate control action by solving a constrained optimal con- 
trol problem over a receding horizon. In particular, the open-loop actuation 
signal u is optimized on a receding time horizon te = m-At to minimize a cost 
J over some prediction horizon t, = m At. The control horizon is typically less 
than or equal to the prediction horizon, and the control is held constant be- 
tween t. and tp. The cost function is typically of a form similar to an LOR cost 
function 


Mp1 Me—-1 
J) = D7 lR rla + Do (lulk + lAn) (10.1) 
k=0 k=1 


where r; is the reference trajectory to be tracked by the MPC, x is the predicted 
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state, ||x||4 = x’Qx, and there is an additional penalty on large changes in 
the control, i.e., on Au; = u; — u,_;. Note that the weight matrix Q must be 
positive semi-definite, and R and Ra must be positive semi-definite, as in LOR. 
It is also possible to add a terminal cost on the final state. This cost function is 
then optimized over the control sequences {uj+1,...,Uj+4,---,Uj4+m,} Subject 
to a surrogate model 


A 


že = F(%;, uj) (10.2) 
with constraints on the inputs u,, 
Umin < uj < Umax; (10.3) 
and on Au;, 
AUmin < Au; < AUmax- (10.4) 


The optimal control is then applied for one time-step, and the procedure is re- 
peated and the receding horizon control re-optimized at each subsequent time- 
step. This results in the control law 


K(x;) = uj+1(x;), (10.5) 


where u;4: is the first time-step of the optimized actuation starting at x;. This 
is illustrated in Fig. For more details, see Kaiser et al. and Fonzi et al. 
247]. 

It is possible to optimize highly customized cost functions, subject to nonlin- 
ear dynamics, with constraints on the actuation and state. However, the com- 
putational requirements of re-optimizing at each time-step are considerable, 
putting limits on the complexity of the model and optimization techniques. For- 
tunately, rapid advances in computing power and optimization are enabling 
MPC for real-time nonlinear control. 

There have been tremendous recent advances in MPC, especially deep MPC, 
which uses deep learning for the surrogate model [438]. Deep MPC has been 
used in a wide range of applications, including for vision-based driving sys- 
tems [212], controlling laser systems [58], fluid flows [84] 513], and aeroelastic 
systems [247]. Tube MPC is a robust strategy that keeps the system within a 
tube around the reference trajectory [242,458], which is useful for safety-critical 
systems with model error and disturbances. More generally, developing robust 
and distributed MPC algorithms with guarantees is a major avenue 
of research in autonomy and robotics. MPC may also be used for model-based 
reinforcement learning [756], essentially codifying the MPC controller into a re- 
inforcement learning policy. Finally, recent years have seen efforts to combine 
differentiable programming, which is a key enabler of modern neural network 
training, with MPC, resulting in differentiable predictive control 
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Figure 10.3: Illustration of model predictive control used to track a set point, 
where the actuation input u is iteratively optimized over a receding horizon. 
Reproduced with permission from Kaiser et al. [366]. 


10.2 Nonlinear System Identification for Control 


The data-driven modeling and control of complex systems is undergoing a rev- 
olution, driven by the rise of big data, advanced algorithms in machine learning 
and optimization, and modern computational hardware. Despite the increasing 
use of equation-free and adaptive control methods, there remains a wealth of 
powerful model-based control techniques, such as linear optimal control (see 
Chapter |8) and model predictive control (MPC) [262]. Increasingly, these 
model-based control strategies are aided by data-driven techniques that char- 
acterize the input-output dynamics of a system of interest from measurements 
alone, without relying on first-principles modeling. Broadly speaking, this is 
known as system identification, which has a long and rich history in control the- 
ory going back decades to the time of Kalman. However, with increasingly 
powerful data-driven techniques, such as those described in Chapter |7, non- 
linear system identification is the focus of renewed interest. 

The goal of system identification is to identify a low-order model of the 
input-output dynamics from actuation u to measurements y. If we are able 
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to measure the full state x of the system, then this reduces to identifying the 
dynamics f that satisfy 


“x = f(x, u). (10.6) 
This problem may be formulated in discrete time, since data is typically col- 
lected at discrete instants in time and control laws are often implemented digi- 
tally. In this case, the dynamics read 


Xk+1 = F (Xx, ug). (10.7) 
When the dynamics are approximately linear, we may identify a linear system 
Xk+1 = AX, + Bug, (10.8) 


which is the approach taken in the dynamic mode decomposition with control 
(DMDc) algorithm below. 

It may also be advantageous to identify a set of measurements y = g(x), in 
which the unforced nonlinear dynamics appear linear: 


Yeti = Ayyr. (10.9) 


This is the approach taken in the Koopman control method below. In this way, 
nonlinear dynamics may be estimated and controlled using standard textbook 
linear control theory in the intrinsic coordinates y [365}/404]. 

Finally, the nonlinear dynamics in or may be identified directly 
using the SINDy with control algorithm. The resulting models may be used 
with model predictive control for the control of fully nonlinear systems [366]. 


DMD with Control 


Proctor et al. extended the DMD algorithm to include the effect of ac- 
tuation and control, in the so-called DMD with control (DMDc) algorithm. It 
was observed that naively applying DMD to data from a system with actuation 
would often result in incorrect dynamics, as the effects of internal dynamics are 
confused with the effects of actuation. DMDc was originally motivated by the 
problem of characterizing and controlling the spread of disease, where it is un- 
reasonable to stop intervention efforts (e.g., vaccinations) just to obtain a char- 
acterization of the unforced dynamics [568]. Instead, if the actuation signal is 
measured, anew DMD regression may be formulated in order to disambiguate 
the effect of internal dynamics from that of actuation and control. Subsequently, 
this approach has been extended to perform DMDc on heavily subsampled or 
compressed measurements by Bai et al. [41]. 
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The DMDc method seeks to identify the best-fit linear operators A and B 
that approximately satisfy the following dynamics on measurement data: 


Xk ~ Ax, + Bu,. (10.10) 


In addition to the snapshot matrix X = [xı x2 --- Xm] and the time-shifted 
snapshot matrix X’ = |x? x3 +- Xm4i] from (7.24), a matrix of the actuation 
input history is assembled: 


Y= |u uw © Uml. (10.11) 
The dynamics in (10.10) may be written in terms of the data matrices: 
X's AX +BY. (10.12) 


As in the DMD algorithm (see Section 7.2), the leading eigenvalues and 
eigenvectors of the best-fit linear operator A are obtained via dimensionality 
reduction and regression. If the actuation matrix B is known, then it is straight- 
forward to correct for the actuation and identify the spectral decomposition of 
A by replacing X’ with X’ — BY in the DMD algorithm: 


(X' — BY) ~ AX. (10.13) 


When B is unknown, both A and B must be simultaneously identified. In 
this case, the dynamics in (10.12) may be recast as 


X 


X= [A B] E 


| = GQ, (10.14) 


and the matrix G = [A B| is obtained via least-squares regression: 
Gex’n. (10.15) 


The matrix Q = [x* Yr*| “is generally a high-dimensional data matrix, which 
may be approximated using the SVD: 


Q = USV*. (10.16) 


The matrix U must be split into two matrices, U = [Uy U3] *, to provide bases 
for X and Y. Unlike the DMD algorithm, U provides a reduced basis for the 
input space, while U from 


xX’ = UDV* (10.17) 
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defines a reduced basis for the output space. It is then possible to approximate 
G = [A BJ by projecting onto this basis: 


G=U'G H . (10.18) 


The resulting projected matrices A and B in G are 


A=U*AU = Ô*X' VÉ UU, (10.19a) 
B=UB =U*X’V>"'U;. (10.19b) 


More importantly, it is possible to recover the DMD eigenvectors ® from the 
eigendecomposition AW = WA: 


& = XVS CÜ ÛW. (10.20) 


Ambiguity in Identifying Closed-Loop Systems 


For systems that are being actively controlled via feedback, with u = —Kx, 
Xk+1 = AX; + Buk (10.21a) 
= Ax, — BKx, (10.21b) 
= (A — BK)x;, (10.21c) 


it is impossible to disambiguate the dynamics A and the actuation BK. In this 
case, it is important to add perturbations to the actuation signal u to provide 
additional information. These perturbations may be a white noise process or 
occasional impulses that provide a kick to the system, providing a signal to 
disambiguate the dynamics from the feedback signal. 


Koopman Operator Nonlinear Control 


For nonlinear systems, it may be advantageous to identify data-driven coordi- 
nate transformations that make the dynamics appear linear. These coordinate 
transformations are related to intrinsic coordinates defined by eigenfunctions 
of the Koopman operator (see Section es Koopman analysis has thus been 
leveraged for nonlinear estimation [679| [680] and control [365] 404) 558]. 

It is possible to design estimators and controllers directly from DMD or 
eDMD models, and Korda et al. used model predictive control (MPC) 
to control nonlinear systems with eDMD models. MPC performance is also 
surprisingly good for DMD models, as shown in Kaiser et al. [366]. In addi- 
tion, Peitz et al. demonstrated the use of MPC for switching control be- 
tween a small number of actuation values to track a reference value of lift in 
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an unsteady fluid flow; for each constant actuation value, a separate eDMD 
model was characterized. Surana and Surana and Banaszuk have 
also demonstrated excellent nonlinear estimators based on Koopman Kalman 
filters. However, as discussed previously, eDMD models may contain many 
spurious eigenvalues and eigenvectors because of closure issues related to find- 
ing a Koopman-invariant subspace. Instead, it may be advantageous to identify 
a handful of relevant Koopman eigenfunctions and perform control directly in 
these coordinates [365]. 

In Section we described several strategies to approximate Koopman 
eigenfunctions, y(x), where the dynamics become linear: 


< (x) = dp(x). (10.22) 


In Kaiser et al. [365] the Koopman eigenfunction equation was extended for 
control-affine nonlinear systems: 


SS = f(x) + Bu. (10.23) 
dt 
For these systems, it is possible to apply the chain rule to dy(x)/dt, yielding 
d 
ae) = V(x) - (f(x) + Bu) (10.24a) 
= A(x) + V(x) - Bu. (10.24b) 


Note that, even with actuation, the dynamics of Koopman eigenfunctions re- 
main linear, and the effect of actuation is still additive. However, now the ac- 
tuation mode V(x) - B may be state-dependent. In fact, the actuation will be 
state-dependent unless the directional derivative of the eigenfunction is con- 
stant in the B direction. Fortunately, there are many powerful generalizations 
of standard Riccati-based linear control theory (e.g., LOR, Kalman filters, etc.) 
for systems with a state-dependent Riccati equation. 


SINDy with Control 


Although it is appealing to identify intrinsic coordinates along which nonlin- 
ear dynamics appear linear, these coordinates are challenging to discover, even 
for relatively simple systems. Instead, it may be beneficial to directly identify 
the nonlinear actuated dynamical system in or (10.7), for use with stan- 
dard model-based control. Using the sparse identification of nonlinear dynam- 
ics (SINDy) method (see Section|7.3) results in computationally efficient mod- 
els that may be used in real time with model predictive control [366]. More- 
over, these models may be identified from relatively small amounts of train- 
ing data, compared with neural networks and other leading machine learning 
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methods, so that they may even be characterized online and in response to 
abrupt changes to the system dynamics. 

The SINDy algorithm is readily extended to include the effects of actuation 
366]. In addition to collecting measurements of the state snapshots x in 
the matrix X, actuation inputs u are collected in the matrix Y from (10.11) as 
in DMDc. Next, an augmented library of candidate right-hand side functions 
@([X Y]) is constructed: 


e([X Y])=[1 X Y X X@Y r ---]. (10.25) 


Here, X ® Y denotes quadratic cross-terms between the state x and the actua- 
tion u, evaluated on the data. 

In SINDy with control (SINDYc), the same sparse regression is used to de- 
termine the fewest active terms in the library required to describe the observed 
dynamics. As in DMDc, if the system is being actively controlled via feedback 
u = K(x), then it is impossible to disambiguate from the internal dynamics and 
the actuation, unless an additional perturbation signal is added to the actuation 
to provide additional information. 


Model Predictive Control Example 


In this example, we will use SINDYc to identify a model of the forced Lorenz 
equations from data and then control this model using MPC. The basic code 
is the same as SINDy, except that the actuation is included as a variable when 
building the library ©. 

We test the SINDYc model identification on the forced Lorenz equations: 


z£=o(y—2z)+ glu), (10.26a) 
Y=uUp—z)-y, (10.26b) 
Žž = Ty — Bz. (10.26c) 


In this example, we train a model using 20 time units of controlled data, and 
validate it on another 20 time units, where we switch the forcing to a periodic 
signal u(t) = 50sin(10t). The SINDy algorithm does not capture the effect of 
actuation, while SINDYc correctly identifies the forced model and predicts the 
behavior in response to a new actuation that was not used in the training data, 
as shown in Fig. 

Finally, SINDYc and neural network models of Lorenz are both used to de- 
sign model predictive controllers, as shown in Fig. Both methods identify 
accurate models that capture the dynamics, although the SINDYc procedure 
requires less data, identifies models more rapidly, and is more robust to noise 
than the neural network model. This added efficiency and robustness are due 
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Figure 10.4: SINDy and SINDYc predictions for the controlled Lorenz system 
in (10.26). Training data consists of the Lorenz system with state feedback. For 
the training period, the input is u(t) = 26 — x(t) + d(t) with a Gaussian distur- 
bance d. Afterward the input u switches to a periodic signal u(t) = 50sin(10f). 
Reproduced with permission from [133]. 


to the sparsity-promoting optimization, which regularizes the model identifi- 
cation problem. In addition, identifying a sparse model requires less data. 


10.3 Machine Learning Control 


Machine learning is a rapidly developing field that is transforming our ability 
to describe complex systems from observational data, rather than first-principles 
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Figure 10.5: Model predictive control of the Lorenz system with a neural net- 
work model and a SINDy model. Reproduced with permission from Kaiser et 


al. [366]. 


modeling [518]. Until recently, these methods have largely been 
developed for static data, although there is a growing emphasis on using ma- 
chine learning to characterize dynamical systems. The use of machine learning 
to learn control laws (i.e., to determine an effective map from sensor outputs to 
actuation inputs) is even more recent [246]. As machine learning encompasses a 
broad range of high-dimensional, possibly nonlinear, optimization techniques, 
it is natural to apply machine learning to the control of complex, nonlinear sys- 
tems. Specific machine learning methods for control include adaptive neural 
networks, genetic algorithms, genetic programming, and reinforcement learn- 
ing. A general machine learning control architecture is shown in Fig.[10.6] Many 
of these machine learning algorithms are based on biological principles, such 
as neural networks, reinforcement learning, and evolutionary algorithms. 

It is important to note that model-free control methodologies may be ap- 
plied to numerical or experimental systems with little modification. All of these 
model-free methods have some sort of macroscopic objective function, typi- 
cally based on sensor measurements (past and present). Some challenging real- 
world example objectives in different disciplines include the following. 


(a) Fluid dynamics: In aerodynamic applications, the goal is often some com- 
bination of drag reduction, lift increase, and noise reduction; while in 
pharmaceutical and chemical engineering applications, the goal may in- 
volve mixing enhancement. 


(b) Finance: The goal is often to maximize profit at a given level of risk toler- 
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Figure 10.6: Schematic of machine learning control wrapped around a complex 
system using noisy sensor-based feedback. The control objective is to minimize 
a well-defined cost function J within the space of possible control laws. An 
offline learning loop provides experiential data to train the controller. Genetic 
programming provides a particularly flexible algorithm to search out effective 
control laws. The vector z contains information that may factor into the cost. 


ance, subject to the law. 


(c) Epidemiology: The goal may be to effectively suppress a disease with 
constraints of sensing (e.g., blood samples, clinics, etc.) and actuation 
(e.g., vaccines, bed nets, etc.). 


(d) Industry: The goal of increasing productivity must be balanced with sev- 
eral constraints, including labor and work safety laws, as well as environ- 
mental impact, which often have significant uncertainty. 


(e) Autonomy and robotics: The goal of self-driving cars and autonomous 
robots is to achieve a task while interacting safely with a complex envi- 
ronment, including cooperating with human agents. 


In the examples above, the objectives involve some minimization or maxi- 
mization of a given quantity subject to some constraints. These constraints may 
be hard, as in the case of disease suppression on a fixed budget, or they may in- 
volve a complex multi-objective tradeoff. Often, constrained optimizations will 
result in solutions that live at the boundary of the constraint, which may ex- 
plain why many companies operate at the fringe of legality. In all of the cases, 
the optimization must be performed with respect to the underlying dynamics 
of the system: fluids are governed by the Navier-Stokes equations, finance is 
governed by human behavior and economics, and disease spread is the result 
of a complex interaction of biology, human behavior, and geography. 

These real-world control problems are extremely challenging, for a num- 
ber of reasons. They are high-dimensional and strongly nonlinear, often with 
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millions or billions of degrees of freedom that evolve according to possibly un- 
known nonlinear interactions. In addition, it may be exceedingly expensive or 
infeasible to run different scenarios for system identification; for example, there 
are serious ethical issues associated with testing different vaccination strategies 
when human lives are at stake. 

Increasingly, challenging optimization problems are being solved with ma- 
chine learning, leveraging the availability of vast and increasing quantities of 
data. Many of the recent successes have been on static data (e.g., image classifi- 
cation, speech recognition, etc.), and marketing tasks (e.g., online sales and ad 
placement). However, current efforts are applying machine learning to analyze 
and control complex systems with dynamics, with the potential to revolution- 
ize our ability to interact with and manipulate these systems. 

The following sections describe a handful of powerful learning techniques 
that are being widely applied to control complex systems where models may 
be unavailable. Note that the relative importance of the following methods are 
not proportional to the amount of space dedicated to them here. 


Reinforcement Learning 


Reinforcement learning (RL) is an important discipline at the intersection of 
machine learning and control [683], and it is currently being used heavily by 
companies for generalized artificial intelligence, autonomous robots, and self- 
driving cars. In reinforcement learning, a control policy is refined over time, 
with improved performance achieved through experience. Because this is such 
an important research area, it is the topic of Chapter|11| 


Iterative Learning Control 


Iterative learning control (ILC) is a widely used tech- 
nique that learns how to refine and optimize repetitive control tasks, such as 
the motion of a robot arm on a manufacturing line, where the robot arm will 
be repeating the same motion thousands of times. In contrast to the feedback 
control methods from Chapter|8| which adjust the actuation signal in real time 
based on measurements, ILC refines the entire open-loop actuation sequence 
after each iteration of a prescribed task. The refinement process may be as sim- 
ple as a proportional correction based on the measured error, or may involve 
a more sophisticated update rule. Iterative learning control does not require 
one to know the system equations and has performance guarantees for linear 
systems. ILC is therefore a mainstay in industrial control for repetitive tasks 
in a well-controlled environment, such as trajectory control of a robot arm or 
printer-head control in additive manufacturing. 
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Figure 10.7: Depiction of parameter cube for PID control. The genetic algorithm 
represents a given parameter value as a genetic sequence that concatenates the 
various parameters. In this example, the parameters are expressed in binary 
representation that is scaled so that 000 is the minimum bound and 111 is the 
upper bound. Color indicates the cost associated with each parameter value. 


Genetic Algorithms 


The genetic algorithm (GA) is one of the earliest and simplest algorithms for pa- 
rameter optimization, based on the biological principle of optimization through 
natural selection and fitness 333]. GA is frequently used to tune and 
adapt the parameters of a controller. In GA, a population comprising many sys- 
tem realizations with different parameter values compete to minimize a given 
cost function, and successful parameter values are propagated to future gener- 
ations through a set of genetic rules. The parameters of a system are generally 
represented by a binary sequence, as shown in Fig.{10.7|for a PID control system 
with three parameters, given by the three control gains Kp, K;, and Kp. Next, 
a number of realizations with different parameter values, called individuals, are 
initialized in a population and their performance is evaluated and compared 
on a given well-defined task. Successful individuals with a lower cost have a 
higher probability of being selected to advance to the next generation, accord- 
ing to the following genetic operations. 


(a) Elitism (optional): A set number of the most fit individuals with the best 
performance are advanced directly to the next generation. 


(b) Replication: An individual is selected to advance to the next generation. 


(c) Crossover: Two individuals are selected to exchange a portion of their 
code and then advance to the next generation; crossover serves to exploit 
and enhance existing successful strategies. 


(d) Mutation: An individual is selected to have a portion of its code modified 
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Figure 10.8: Schematic illustrating evolution in a genetic algorithm. The in- 
dividuals in generation k are each evaluated and ranked in ascending order 
based on their cost function, which is inversely proportional to their probabil- 
ity of selection for genetic operations. Then, individuals are chosen based on 
this weighted probability for advancement to generation k + 1 using the four 
operations: elitism, replication, crossover, and mutation. This forms generation 
k + 1, and the sequence is repeated until the population statistics converges or 
another suitable stopping criterion is reached. 


with new values; mutation promotes diversity and serves to increase the 
exploration of parameter space. 


For the replication, crossover, and mutation operations, individuals are ran- 
domly selected to advance to the next generation with the probability of selec- 
tion increasing with fitness. The genetic operations are illustrated for the PID 
control example in Fig. These generations are evolved until the fitness of 
the top individuals converges or other stopping criteria are met. 

Genetic algorithms are generally used to find nearly globally optimal pa- 
rameter values, as they are capable of exploring and exploiting local wells in the 
cost function. GA provides a middle ground between a brute-force search and a 
convex optimization, and is an alternative to expensive Monte Carlo sampling, 
which does not scale to high-dimensional parameter spaces. However, there is 
no guarantee that genetic algorithms will converge to a globally optimal solu- 
tion. There are also a number of hyperparameters that may affect performance, 
including the size of the populations, number of generations, and relative se- 
lection rates of the various genetic operations. 

Genetic algorithms have been widely used for optimization and control in 
nonlinear systems [246]. For example, GA was used for parameter tuning in 
open-loop control [516], with applications in jet mixing [406], combustion pro- 
cesses [137], wake control [259,567], and drag reduction [270]. GA has also been 
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Figure 10.9: Illustration of the function tree used to represent the control law u 
in genetic programming control. 


employed to tune an Hæ controller in a combustion experiment [312]. 


Genetic Programming 


Genetic programming (GP) is a powerful generalization of genetic 
algorithms that simultaneously optimizes both the structure and parameters 
of an input-output map. Recently, genetic programming has also been used to 
obtain control laws that map sensor outputs to actuation inputs, as shown in 
Fig. The function tree representation in GP is quite flexible, enabling the 
encoding of complex functions of the sensor signal y through a recursive tree 
structure. Each branch is a signal, and the merging points are mathematical 
operations. Sensors and constants are the leaves, and the overall control signal 
u is the root. The genetic operations of crossover, mutation, and replication are 
shown schematically in Fig. This framework is readily generalized to 
include delay coordinates and temporal filters, as discussed in Duriez et al. 
[225]. 

Genetic programming has been recently used with impressive results in 
turbulence control experiments, led by Bernd Noack and collaborators 
547]. This provides a new paradigm of control for strongly 
nonlinear systems, where it is now possible to identify the structure of non- 
linear control laws. Genetic programming control is particularly well suited to 
experiments where it is possible to rapidly evaluate a given control law, en- 
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Figure 10.10: Genetic operations used to advance function trees across genera- 
tions in genetic programming control. The relative selection rates of replication, 
crossover, and mutation are p(R) = 0.1, p(C) = 0.7, and p(M) = 0.2, respec- 
tively. 


abling the testing of hundreds or thousands of individuals in a short amount 
of time. Current demonstrations of genetic programming control in turbulence 
have produced several macroscopic behaviors, such as drag reduction and mix- 
ing enhancement, in an array of flow configurations. Specific flows include the 


mixing layer [226 547], the backward-facing step [227 ], and a 
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Figure 10.11: Proportional—integral—derivative (PID) control schematic. PID re- 
mains ubiquitous in industrial control. 


turbulent separated boundary layer [227]. 


Example: Genetic Algorithm to Tune PID Control 


In this example, we will use the genetic algorithm to tune a proportional- 
integral—-derivative (PID) controller. However, it should be noted that this is 
just a simple demonstration of evolutionary algorithms, and such heavy ma- 
chinery is not recommended to tune a PID controller in practice, as there are 
far simpler techniques. 

PID control is among the simplest and most widely used control architec- 
tures in industrial control systems, including for motor position and velocity 
control, for tuning of various subsystems in an automobile, and for the pres- 
sure and temperature controls in modern espresso machines, to name only a 
few of the myriad applications. As its name suggests, PID control additively 
combines three terms to form the actuation signal, based on the error signal 
and its integral and derivative in time. A schematic of PID control is shown in 
Fig. [10.11] 

In the cruise control example in Section we saw that it was possible 
to reduce reference tracking error by increasing the proportional control gain 
Kp in the control law u = —Kp(w, — y). However, increasing the gain may 
eventually cause instability in some systems, and it will not completely elim- 
inate the steady-state tracking error. The addition of an integral control term, 
Kı i (w, — y) is useful to eliminate steady-state reference tracking error while 
alleviating the work required by the proportional term. 

There are formal rules for how to choose the PID gains for various design 
specifications, such as fast response and minimal overshoot and ringing. In this 
example, we explore the use of a genetic algorithm to find effective PID gains 
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to minimize a cost function. We use an LOR cost function 


T 
m Q(w, — y}? + Ru? dr, 
0 


with Q = 1 and R = 0.001 for a step response w, = 1. The system to be con- 
trolled will be given by the transfer function 


1 
G(s) = ——. 
(s) sits 
The first step is to write a function that evaluates a given PID controller, as 
in Code The three PID gains are stored in the variable parms. 


Code 10.1: [MATLAB] Evaluate cost function for PID controller. 
function J = pidtest(G,dt,parms) 


Soin (Case ar 


K = parms(1) + parms(2)/s + parms(3)*s/(1+.001*s); 
Loop = series (K,G); 

ClosedLoop = feedback (Loop,1); 

E = Ondt :207 

[y,t] = step(ClosedLoop,t); 


CTRLtf = K/(1+KxG); 


U= leamh Eya); 
omp R = .001; 
elite SUI (I(t) 2 E RU (Goes) 2) 


Code 10.1: [Python] Evaluate cost function for PID controller. 
def pidtest (G,dt,parms) : 


S = ted) 
K = parms[0] + parms[1]/s + parms[2]*s/(1+0.001«s) 
Loop = series (K,G) 


ClosedLoop = feedback (Loop, 1) 
t = np.arange(0,20,dt) 
y,t = step (ClosedLoop, 1) 


CTRLtf£f = K/ (1+K«G) 


We samy oe) Oi) 
Q= 1 
R = 0.001 


J = dt*xnp.sum(np.power (Q@(1l-y.reshape(-1)),2) + R @ np. 
power (u. reshape (-1),2)) 
return J 
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Figure 10.12: Cost function across generations, as GA optimizes PID gains. 


Next, it is relatively simple to use a genetic algorithm to optimize the PID 
control gains, as in Code In this example, we run the GA for 10 genera- 
tions, with a population size of 25 individuals per generation. 


Code 10.2: [MATLAB] Genetic algorithm to tune PID controller. 


dt = 0.001; 

POpoize = 257 
MaxGenerations = 10; 
SEE AS 

G= s: (sss ts thh, 


options = optimoptions (@ga,’PopulationSize’ ,PopSize,’ 
MaxGenerations’,MaxGenerations,’ OutputFcn’, @myfun) ; 
[x,fval] = ga(@(K)pidtest (G,dt,K),3,-eye(3),zeros (3,1) 


rl],(],[],(1,[],options); 


It is also possible to reproduce this example in Python using the distributed 
evolutionary algorithms in Python (DEAP) package atht tps: //github.com/ 
Python code is available on the book’s GitHub, although it is too 
long to reproduce here, as there is not a simple one-line ga command, as in 
MATLAB. 

The results from intermediate generations may be saved using a custom 
output function, as described in the myfun.m code on the book’s GitHub. 

The evolution of the cost function across various generations is shown in 
Fig. As the generations progress, the cost function steadily decreases. 
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Figure 10.13: PID gains generated from genetic algorithm. Red points corre- 
spond to early generations while blue points correspond to later generations. 
The black point is the best individual found by GA. 


0 2 4 6 8 10 12 14 16 18 20 


Figure 10.14: PID controller response from first generation of genetic algorithm. 


The individual gains are shown in Fig. with redder dots corresponding 
to early generations and bluer dots corresponding to later generations. As the 
genetic algorithm progresses, the PID gains begin to cluster around the optimal 
solution (black circle). 

Figure shows the output in response to the PID controllers from the 
first generation. It is clear from this plot that many of the controllers fail to 
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Figure 10.15: PID controller response from last generation of genetic algorithm. 


stabilize the system, resulting in large deviations in y. In contrast, Fig. [10.15] 
shows the output in response to the PID controllers from the last generation. 
Overall, these controllers are more effective at producing a stable step response. 

The best controllers from each generation are shown in Fig. In this 
plot, the controllers from early generations are redder, while the controllers 
from later generations are bluer. As the GA progresses, the controller is able to 
minimize output oscillations and achieves fast rise-time. 


10.4 Adaptive Extremum-Seeking Control 


Although there are many powerful techniques for model-based control design, 
there are also a number of drawbacks. First, in many systems, there may not be 
access to a model, or the model may not be suitable for control (i.e., there may 
be strong nonlinearities or the model may be represented in a non-traditional 
form). Next, even after an attractor has been identified and the dynamics char- 
acterized, control may invalidate this model by modifying the attractor, giving 
rise to new and uncharacterized dynamics. The obvious exception is stabilizing 
a fixed point or a periodic orbit, in which case effective control keeps the system 
in a neighborhood where the linearized model remains accurate. Finally, there 
may be slow changes to the system that modify the underlying dynamics, and 
it may be difficult to measure and model these effects. 

The field of adaptive control broadly addresses these challenges, by allow- 
ing the control law the flexibility to modify its action based on the changing 
dynamics of a system. Extremum-seeking control (ESC) is a particu- 
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Figure 10.16: Best PID controllers from each generation. Red trajectories are 
from early generations, and blue trajectories correspond to the last generation. 


larly attractive form of adaptive control for complex systems because it does 
not rely on an underlying model and it has guaranteed convergence and stabil- 
ity under a set of well-defined conditions. Extremum-seeking may be used to 
track local maxima of an objective function, despite disturbances, varying sys- 
tem parameters, and nonlinearities. Adaptive control may be implemented for 
in-time control or used for slow tuning of parameters in a working controller. 

Extremum-seeking control may be thought of as an advanced perturb-and- 
observe method, whereby a sinusoidal perturbation is additively injected in the 
actuation signal and used to estimate the gradient of an objective function 
J that should be maximized or minimized. The objective function is gener- 
ally computed based on sensor measurements of the system, although it ulti- 
mately depends on the internal dynamics and the choice of the input signal. 
In extremum-seeking, the control variable u may refer either to the actuation 
signal or to a set of parameters that describe the control behavior, such as the 
frequency of periodic forcing or the gains in a PID controller. 

The extremum-seeking control architecture is shown in Fig.{10.17] This schematic 
depicts ESC for a scalar input u, although the methods readily generalize for 
vector-valued inputs u. A convex objective function J(u) is shown in Fig. 10.18] 
for static plant dynamics (i.e., for y = u). The extremum-seeking controller uses 
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Figure 10.17: Schematic illustrating an extremum-seeking controller. A sinu- 
soidal perturbation is added to the best guess of the input û, and it passes 
through the plant, resulting in a sinusoidal output perturbation that may be 
observed in the sensor signal y and the cost J. The high-pass filter results in 
a zero-mean output perturbation, which is then multiplied (demodulated) by 
the same input perturbation, resulting in the signal €. This demodulated signal 
is finally integrated into the best guess å for the optimizing input u. 


an input perturbation to estimate the gradient of the objective function J and 
steer the mean actuation signal towards the optimizing value. 
Three distinct timescales are relevant for extremum-seeking control: 


(a) slow — external disturbances and parameter variation; 

(b) medium — perturbation frequency w; 

(c) fast — system dynamics. 
In many systems, the internal system dynamics evolve on a fast timescale. For 
example, turbulent fluctuations may equilibrate rapidly compared to actuation 
timescales. In optical systems, such as a fiber laser [129], the dynamics of light 
inside the fiber are extremely fast compared to the timescales of actuation. 


In extremum-seeking control, a sinusoidal perturbation is added to the es- 
timate of the input that maximizes the objective function, di: 


u = û + asin(wt). (10.27) 


This input perturbation passes through the system dynamics and output, re- 
sulting in an objective function J that varies sinusoidally about some mean 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


494 CHAPTER 10. DATA-DRIVEN CONTROL 


aa u û >u” 
Figure 10.18: Schematic illustrating extremum-seeking control on a static objec- 
tive function J(u). The output perturbation (red) is in phase when the input 
is left of the peak value (i.e., u < u*) and out of phase when the input is to 
the right of the peak (i.e., u > u*). Thus, integrating the product of input and 
output sinusoids moves û towards u*. 


value, as shown in Fig.[10.18| The output J is high-pass-filtered to remove the 
mean (DC component), resulting in the oscillatory signal p. A simple high-pass 
filter is represented in the frequency domain as 


S 
SHUR 


(10.28) 


where s is the Laplace variable and wy, is the filter frequency. The high-pass filter 
is chosen to pass the perturbation frequency w. The high-pass-filtered output is 
then multiplied by the input sinusoid, possibly with a phase shift ¢, resulting 
in the demodulated signal £: 


E = asin(wt — @)p. (10.29) 


This signal € is mostly positive if the input u is to the left of the optimal value 
u* and it is mostly negative if u is to the right of the optimal value u*, shown 
as red curves in Fig. Thus, the demodulated signal € is integrated into t, 
the best estimate of the optimizing value, 

d 


Ce 10. 
“iu = KE, (10.30) 
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so that the system estimate û is steered towards the optimal input u*. Here, k 
is an integral gain, which determines how aggressively the actuation climbs 
gradients in J. 

Roughly speaking, the demodulated signal £ measures gradients in the ob- 
jective function, so that the algorithm climbs to the optimum more rapidly 
when the gradient is larger. This is simple to see for constant plant dynamics, 
where J is simply a function of the input, J(u) = J(ù + asin(wt)). Expanding 
J(u) in the perturbation amplitude a, which is assumed to be small, yields 


J(u) = J(u + asin(wt)) (10.31a) 
= J(û) + a asin(wt) + O(a’). (10.31b) 


The leading-order term in the high-pass-filtered signal is p ~ 0J/Ou|,,_, asin(wt). 
Averaging € = asin(wt — ¢)p over one period yields 


w 27 /w 
ee f asin(wt — @)p dt (10.32a) 

2T Jo 
W 2T/w OJ J ; 

=o, A Ju ya a” sin(wt — @) sin(wt) dt (10.32b) 
a’ Od 

Se ge : .32 
7 ul, cos(@) (10.32c) 


Thus, for the case of trivial plant dynamics, the average signal €,,,. is propor- 
tional to the gradient of the objective function J with respect to the input u. 

In general, extremum-seeking control may be applied to systems with non- 
linear dynamics relating the input u to the outputs y that act on a faster timescale 
than the perturbation w. Thus, J may be time-varying, which complicates the 
simplistic averaging analysis above. The general case of extremum-seeking con- 
trol of nonlinear systems is analyzed by Krsti¢ and Wang [415], where they de- 
velop powerful stability guarantees based on a separation of timescales and 
a singular perturbation analysis. The basic algorithm may also be modified to 
add a phase ¢ to the sinusoidal input perturbation in (10.29). In [415], there was 
an additional low-pass filter w;/(s + w;) placed before the integrator to extract 
the DC component of the demodulated signal €. There is also an extension to 
extremum-seeking called slope-seeking, where a specific slope is sought 
instead of the standard zero slope corresponding to a maximum or minimum. 
Slope-seeking is preferred when there is not an extremum, as in the case when 
control inputs saturate. Extremum-seeking is often used for frequency selection 
and slope-seeking is used for amplitude selection when tuning an open-loop 
periodic forcing. 

It is important to note that extremum-seeking control will only find local 
maxima of the objective function, and there are no guarantees that this will cor- 
respond to a global maximum. Thus, it is important to start with a good initial 
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Figure 10.19: Extremum-seeking control response for cost function in (10.33). 


condition for the optimization. In a number of studies, extremum-seeking con- 
trol is used in conjunction with other global optimization techniques, such as a 
genetic algorithm, or sparse representation for classification [130) 256]. 


Simple Example of Extremum-Seeking Control 


Here we consider a simple application of extremum-seeking control to find the 
maximum of a static quadratic cost function, 


J(u) = 25 — (5 — u)’. (10.33) 


This function has a single global maximum at u* = 5. Starting at u = 0, we 
apply extremum-seeking control with a perturbation frequency of w = 10 Hz 
and an amplitude of a = 0.2. Figure [10.19| shows the controller response and 
the rapid tracking of the optimal value u* = 5. MATLAB and Python codes that 
implement extremum-seeking using a simple Butterworth high-pass filter are 
provided on the book’s GitHub. 

Notice that when the gradient of the cost function is larger (i.e., closer to 
u = 0), the oscillations in J are larger, and the controller climbs more rapidly. 
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Figure 10.20: Extremum-seeking control response with a slowly changing cost 
function J(u, t). 


When the input u gets close to the optimum value at u* = 5, even though the 
input perturbation has the same amplitude a, the output perturbation is nearly 
zero (on the order of a”), since the quadratic cost function is flat near the peak. 
Thus we achieve fast tracking far away from the optimum value and small 
deviations near the peak. 

To see the ability of extremum-seeking control to handle varying system 
parameters, consider the time-dependent cost function given by 


J(u) = 25 — (5 — u —sin(¢))’. (10.34) 


The varying parameters, which oscillate at 1/(27) Hz, may be considered slow 
compared with the perturbation frequency 10 Hz. The response of extremum- 
seeking control for this slowly varying system is shown in Fig. In this 
response, the actuation signal is able to maintain good performance by oscil- 
lating back and forth to approximately track the oscillating optimal u*, which 
oscillates between 4 and 6. The output function J remains close to the optimal 
value of 25, despite the unknown varying parameter. 
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Figure 10.21: Schematic of a specific extremum-seeking control architecture that 
benefits from a wealth of design techniques [26} [178]. 


Challenging Example of Extremum-Seeking Control 


Here we consider an example inspired by a challenging benchmark problem 
in Section 1.3]. This system has a time-varying objective function J(t) and 
dynamics with a right half-plane zero, making it difficult to control. 

In one formulation of extremum-seeking [26|[178], there are additional guide- 
lines for designing the controller if the plant can be split into three blocks that 
define the input dynamics, a time-varying objective function with no internal 
dynamics, and the output dynamics, as shown in Fig. In this case, there 
are procedures to design the high-pass filter and integrator blocks. 

In this example, the objective function is given by 


J(0) = 0.056(¢ — 10) + (6 — 6*(4))’, 
where ô is the Dirac delta function, and the optimal value 0*(t) is given by 
6* = 0.01 + 0.001¢. 


The optimal objective is given by J* = 0.05d(t — 10). The input and output 
dynamics are taken from the example in [26], and are given by 


s—1 1 

Finals) = (s+ 2541) and Fouls) = FSE 

Using the design procedure in [26], one arrives at the high-pass filter s/(s + 
5) and an integrator-like block given by 50(s — 4)/(s — 0.01). In addition, a per- 
turbation with w = 5 and a = 0.05 is used, and the demodulating perturbation 
is phase-shifted by ¢ = 0.7955; this phase is obtained by evaluating the input 
function Fin at iw. The response of this controller is shown in Fig. along 
with the Simulink implementation in Fig. The controller is able to accu- 
rately track the optimizing input, despite additive sensor noise. 
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Figure 10.22: Extremum-seeking control response for a challenging test system 
with a right half-plane zero, inspired by [26]. 
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Figure 10.23: Simulink model for extremum-seeking controller used in 


Fig, 1023 
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Applications of Extremum-Seeking Control 


Because of the lack of assumptions and ease of implementation, extremum- 
seeking control has been widely applied to a number of complex systems. Al- 
though ESC is generally applicable for in-time control of dynamical systems, it 
is also widely used as an online optimization algorithm that can adapt to slow 
changes and disturbances. Among the many uses of extremum-seeking control, 
here we highlight only a few. 

Extremum-seeking has been used widely for maximum power point track- 
ing algorithms in photovoltaics 439], and wind energy conver- 
sion [517]. In the case of photovoltaics, the voltage or current ripple in power 
converters due to pulse-width modulation is used for the perturbation signal; 
and in the case of wind, turbulence is used as the perturbation. Atmospheric 
turbulent fluctuations were also used as the perturbation signal for the opti- 
mization of aircraft control ; in this example it is infeasible to add a per- 
turbation signal to the aircraft control surfaces, and a natural perturbation is 
required. ESC has also been used in optics and electronics for laser pulse shap- 
ing [599], for tuning high-gain fiber lasers [130], and for beam control in 
a reconfigurable holographic metamaterial antenna array [349]. Other applica- 
tions include formation flight optimization [87], bioreactors [741], PID and 
PI tuning, active braking systems [771], and control of Tokamaks [541]. 

Extremum-seeking has also been broadly applied in turbulent flow control. 
Despite the ability to control dynamics in-time with ESC, it is often used as 
a slow feedback optimization to tune the parameters of a working open-loop 
controller. This slow feedback has many benefits, such as maintaining perfor- 
mance despite slow changes to environmental conditions. Extremum-seeking 
has been used to control an axial flow compressor [742], to reduce drag over 
a bluff body in an experiment using a rotating cylinder on the upper 
trailing edge of the rear surface, and for separation control in a high-lift airfoil 
configuration using pressure sensors and pulsed jets on the leading edge 
of a single-slotted flap. There have also been impressive industrial-scale uses of 
extremum-seeking control, for example to control thermoacoustic modes across 
a range of frequencies in a 4 MW gas turbine combustor [48] [50]. It has also been 
utilized for separation control in a planar diffusor that is fully turbulent and 
stalled [49], and to control jet noise [493]. 

There are numerous extensions to extremum-seeking that improve perfor- 
mance. For example, extended Kalman filters were used as the filters in to 
control thermoacoustic instabilities in a combustor experiment, reducing pres- 
sure fluctuations by nearly 40 dB. Kalman filters were also used with ESC to 
reduce the flow separation and increase the pressure ratio in a high-pressure 
axial fan using an injected pulsed air stream [752]. Including the Kalman filter 
improved the controller bandwidth by a factor of 10 over traditional ESC. 
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Homework 


Exercise 10-1. In this exercise, we will compare DMDc, SINDYc, and a neural 
network (NN) for use with MPC to control the Lorenz system, following Kaiser 
et al. [366]. 


First, generate training data by simulating the forced Lorenz system: 


t= 10(y — x) + u, 
z= ry — (8/3)z. 


Generate data using a small time-step At = 0.001 from t = 0 tot = 10 witha 
rich control input u(t) = (2sin(t) + sin(0.1¢))?. Use this data to train a DMDc, 
SINDYc, and NN model. For the NN model, start with a single hidden layer 
with 10 neurons using hyperbolic tangent sigmoid activation functions. Com- 
pare the performance of all of these models to predict the response to a new 
forcing u(t) = (5 sin(30t))? for t = 10 to t = 20. 

Finally, design an MPC controller based on each of these models to stabilize the 
fixed point (x,y, z) = (— v72, — v72, 27). 

(Bonus) Compare the prediction and control performance of the various mod- 
els as a function of the amount of data used in the training phase. 


Exercise 10-2. (Advanced) This exercise will develop a model predictive con- 
troller for the fluid flow past a cylinder. There are several open-source codes 
that can be used to simulate simple fluid flows, such as the IBPM code at 
https://github.com/cwrowley/ibpm/ 


(a) First, generate a training data set that simulates the vortex shedding be- 
hind a stationary cylinder at a Reynolds number of 100. Compute the 
mean flow field during the periodic vortex shedding portion, after initial 
transients have died out. In addition, generate training data correspond- 
ing to the cylinder rotating, which will be our control input. Since the 
absolute angle of the cylinder is irrelevant due to symmetry, we will con- 
sider the angular rate of the cylinder as the control input. For all data, 
subtract the mean flow computed above. 


(b) Train a DMDc model based on this data, and test the performance of this 
model on a test data set of the cylinder rotating. Plot the various responses 
and discuss the performance. 


(c) Use this DMDc model for MPC with the goal of stabilizing a symmet- 
ric configuration, where the DMDc state is equal to zero. Plot the perfor- 
mance and discuss the results. 
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(d) Now, instead of using the full flow field to characterize a DMDc model, 
use the lift and drag coefficients, Cz and Cp; you may need to build 
an augmented state vector, using either these coefficients and their time 
derivatives or time-delayed values. Similarly, test the performance of this 
model on the test data and discuss. 


(e) Use this force-based DMDc model to develop an MPC that tracks a given 
reference lift value, say Cz = 1 or Cz = —1. See if you can make your 
controller track a reference that switches between these values. What if 
the reference lift is much larger, say Cz = 2 or Cr = 5? 


(f) Use this model to develop an MPC that tracks a reference drag value, such 
as Cp = 0. Is it possible to simultaneously track a reference lift and drag 
value? Why or why not? 


Exercise 10-3. (Advanced) Repeat the exercise above, but instead of using a 
DMDc model, construct a neural network model for a deep MPC. This exercise 
follows the work of Morton et al. [513] and Bieker et al. [84]. 


Compare the results from the NN-based MPC and the DMDc-based MPC. 


Exercise 10-4. Design an optimal full-state LOR controller using a genetic algo- 
rithm for the spring-mass—damper system: 


€+5¢ 4+ 4r = u(t). 


Plot the closed-loop eigenvalues and the LOR cost J as a function of the gen- 
eration. Note that you will need to specify the gain matrices Q and R to define 
the cost function J. 


Exercise 10-5. Design an extremum-seeking controller to find the peak of a 
quartic cost function J(u) = 16—(2—w)*, similar to the example with quadratic 
cost in (10.33). Do you need to modify the extremum-seeking control parame- 
ters? Discuss the performance. 
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Chapter 11 


Reinforcement Learning 


Reinforcement learning (RL) is a major branch of machine learning that is con- 
cerned with how to learn control laws and policies to interact with a complex 
environment from experience [683]. Thus, RL is situated at the growing 
intersection of control theory and machine learning [589], and it is among the 
most promising fields of research towards generalized artificial intelligence and 
autonomy. Both machine learning and control theory fundamentally rely on op- 
timization, and, likewise, RL involves a set of optimization techniques within 
an experiential framework for learning how to interact with the environment. 

In reinforcement learning, an agenf|'|senses the state of its environment and 
learns to take appropriate actions to optimize future rewards. The ultimate goal 
in RL is to learn an effective control strategy or set of actions through posi- 
tive or negative reinforcement. This search may involve trial-and-error learn- 
ing, model-based optimization, or a combination of both. In this way, reinforce- 
ment learning is fundamentally biologically inspired, mimicking how animals 
learn to interact with their environment through positive and negative reward 
feedback from trial-and-error experience. Much of the history of reinforcement 
learning, and machine learning more broadly, has been linked to studies of 
animal behavior and the neurological basis of decisions, control, and learning 
[645]. For example, Pavlov’s dog is an illustration that animals 
learn to associate environmental cues with a food reward [551]. The term re- 
inforcement refers to the rewards, such as food, used to reinforce desirable ac- 
tions in humans and animals. However, in animal systems, reinforcement is 
ultimately achieved through cellular and molecular learning rules. 

Multiple textbooks have been written on this topic, which spans almost a 
century of progress. Major advances in deep reinforcement learning are also 
rapidly changing the landscape. This chapter is not meant to be comprehensive; 
rather, it aims to provide a solid foundation, to introduce key concepts and 
leading approaches, and to lower the barrier to entry in this exciting field. 


'Tronically, from the perspective of reinforcement learning, in The Matrix, Neo is actually the 
agent learning to interact with his environment. 
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Figure 11.1: Schematic of reinforcement learning, where an agent senses its en- 
vironmental state s and takes actions a according to a policy ~ that is optimized 
through learning to maximize future rewards r. In this case, a deep neural net- 
work is used to represent the policy m. This is known as a deep policy network. 


11.1 Overview and Mathematical Formulation 


Figure11.1|provides a schematic overview of the reinforcement learning frame- 
work. An RL agent senses the state of its environment and learns to take appro- 
priate actions to achieve optimal immediate or delayed rewards. Specifically, 
the RL agent arrives at a sequence of different states s, € S by performing 
actions a, € A, with the selected actions leading to positive or negative re- 
wards rą used for learning. The sets S and A denote the sets of possible states 
and actions, respectively. Importantly, the RL agent is capable of learning from 
delayed rewards, which is critical for systems where the optimal solution in- 
volves a multi-step procedure. Rewards may be thought of as sporadic and 
time-delayed labels, leading to RL being considered a third major branch of 
machine learning, called semi-supervised learning, which complements the other 
two branches of supervised and unsupervised learning. One canonical exam- 
ple is learning a set of moves, or a long-term strategy, to win a game of chess. 
As is the case with human learning, RL often begins with an unstructured explo- 
ration, where trial and error are used to learn the rules, followed by exploitation, 
where a strategy is chosen and optimized within the learned rules. 
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The Policy 


An RL agent senses the state of its environment s and takes actions a through 
a policy 7 that is optimized through learning to maximize future rewards r. 
Reinforcement learning is often formulated as an optimization problem to learn 
the policy 7(s, a), 


m(s,a) =Pr(a=a|s=s), (11.1) 


which is the probability of taking action a, given state s, to maximize the total 
future rewards. In the simplest formulation, the policy may be a look-up table 
that is defined on the discrete state and action spaces S and A, respectively. 
However, for most problems, representing and learning this policy becomes 
prohibitively expensive, and 7 must instead be represented as an approximate 
function that is parameterized by a lower-dimensional vector 0: 


m(s,a) © 7(s,a, 0). (11.2) 


Often, this parameterized function will be denoted 79(s, a). Function approx- 
imation is the basis of deep reinforcement learning in Section{11.4] where it is 
possible to represent these complex functions using deep neural networks. 

Note that, in the literature, there is often an abuse of notation, where 7 (s, a) 
is used to denote the action taken, rather than the probability of taking an action 
a given a state observation s. In the case of a deterministic policy, such as a 
greedy policy, then it may be possible to use a = z(s) to represent the action 
taken. We will attempt to be clear throughout when choosing one convention 
over another. 


The Environment: a Markov Decision Process (MDP) 


In general, the measured state of the system may be a partial measurement of a 
higher-dimensional environmental state that evolves according to a stochastic, 
nonlinear dynamical system. For simplicity, most introductions to RL assume 
that the full state is measured and that it evolves according to a Markov de- 
cision process (MDP), so that the probability of the system occurring in the 
current state is determined only by the previous state. We will begin with this 
simple formulation. Even when it is assumed that the state evolves according to 
an MDP, it is often the case that this model is not known, motivating the use of 
“model-free” RL strategies discussed in Section{11.3] Similarly, when a model is 
not known, it may be possible to first learn an MDP using data-driven methods 
and then use this for “model-based” reinforcement learning, as in Section{11.2| 

An MDP consists of a set of states S, a set of actions A, and a set of rewards 
R, along with the probability of transitioning from state s+ at time t, to state 
S,41 at time tı given action ax, 


P(s', S, a) = Pr(Sk+1 =s | Sk = S, ak = a), (11.3) 
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and a reward function R, 
R(s',8, a) = Pr(rk+1 | Sk+1 = S’, Sk = S, ap = A). (11.4) 


Sometimes the transition probability P(s’,s,a) will be written as P(s’ | s,a). 
Again, sometimes there will be an abuse of notation, where a chosen policy 7 
will be used instead of the action a in the argument of either P or R above. In 
this case, it is assumed that this applies a sum over states, as in 


P(s' s,m) = X x(s, a)P(s',s,a). (11.5) 
acA 

Thus, an MDP generalizes the notion of a Markov process to include ac- 
tions and rewards, making it suitable for decision making and control. A sim- 
ple Markov process is a set of states S and a probability of transitioning from 
one state to the next. The defining property of a Markov process and an MDP is 
that the probability of being in a future state is entirely determined by the cur- 
rent state, and not by previous states or hidden variables. The MDP framework 
is closely related to transition state theory and the Perron—Frobenius operator, 

which is the adjoint of the Koopman operator from Section|7.4| 
In the case of a simple Markov process with a finite set of states S, then it 
is possible to let s € R” be a vector of the probability of being in each of the n 
states, in which case the Markov process P(s’,s) may be written in terms of a 
transition matrix, also known as a stochastic matrix, or a probability matrix, T: 


s = Ts, (11.6) 


where each column of T must add up to 1, which is a statement of conservation 
of probability that, given a particular state s, something must happen after the 
transition to s’. Similarly, for an MDP, given a policy ~, the transition process 
may be written as 


s = ` m(s,a)T,s. (11.7) 
acA 
Now, for each action a, T, is a Markov process with all columns summing to 1. 
One of the defining properties of a Markov process is that the system asymp- 
totically approaches a steady state jz, which is the eigenvector of T correspond- 
ing to eigenvalue 1. Similarly, given a policy 7, an MDP asymptotically ap- 
proaches a steady state 4, = >>, T(S, a) Ma. 
This brings up another notational issue, where, for continuous processes, 
s € R” describes the continuous state vector in an n-dimensional vector space, 
as in Chapters |7]and |8} while, for discrete state spaces, s € R” denotes a vector 
of probabilities of belonging to one of n finite states. It is important to care- 
fully consider which notation is being used for a given problem, as these for- 
mulations have different dynamics (i.e., differential equation versus MDP) and 
interpretations (i.e., deterministic dynamics versus probabilistic transitions). 
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The Value Function 


Given a policy m, we next define a value function that quantifies the desirability 
of being in a given state: 


V,(s) = E ( ` yËry 
k=0 


where E is the expected future reward, given a policy m and subject to a discount 
rate y. Future rewards are discounted, reflecting the economic principle that 
current rewards are more valuable than future rewards. Often, the subscript 7 
is omitted from the value function, in which case we refer to the value function 
for the best possible policy: 


V(s) = max (Zn 
k=0 


One of the most important properties of the value function is that the value 
at a state s may be written recursively as 


So = ‘) (11.8) 


So = s) (11.9) 


V (s) = max E (r + Narr Sı = s); (11.10) 
k=1 
which implies that 
V (s) = max E(ro + 7V(s’)), (11.11) 


where s’ = s41 is the next state after s = są given action a;,, and the expectation 
is over actions selected from the optimal policy 7. This expression, known as 
Bellman’s equation, is a statement of Bellman’s principle of optimality, and it is 
a central result that underpins modern RL. 

Given the value function, it is possible to extract the optimal policy as 


m = argmax E(rp + yV(s’)). (11.12) 


T 


Goals and Challenges of Reinforcement Learning 


Learning the policy 7, the value function V, or jointly learning both, is the cen- 
tral challenge in RL. Depending on the assumed structure of v, the size of S and 
evolution dynamics P, and the reward landscape R, determining an optimal 
policy may range from a closed-form optimization to a rather high-dimensional 
unstructured optimization. Thus, a large number of trials must often be evalu- 
ated in order to determine an optimal policy. In practice, reinforcement learning 
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may be very expensive to train, and it might not be the right strategy for prob- 
lems where testing a policy is expensive or potentially unsafe. Similarly, there 
are often simpler control strategies than RL, such as LOR or MPC; when these 
approaches are effective, they are often preferable. Reinforcement learning is, 
therefore, well suited for situations where some combination of the following 
are true: evaluating a policy is inexpensive, as in board games; there are suffi- 
cient resources to perform a near-brute-force optimization, as in evolutionary 
optimization; and/or no other control strategy works. 

Although RL is typically formulated within the mathematical framework 
of MDPs, many real-world applications do not satisfy these assumptions. For 
a partially observed MDP (POMDP), the dynamics depend on the state history 
or on hidden variables. Similarly, the evolution dynamics may be entirely de- 
terministic, yet chaotic. However, as we will see, it is often possible to develop 
approximate probabilistic transition state models for chaotic dynamics or to 
augment the environment state to include past states for systems with mem- 
ory or hidden variables. Often, the underlying MDP transition probability and 
reward functions are not known a priori, and either must be learned ahead of 
time through some exploration phase, or alternative model-free optimization 
techniques must be used. Finally, many of the theoretical convergence results, 
and indeed many of the fundamental RL algorithms, only apply to finite MDPs, 
which are characterized by finite sets of actions A and states S. Games, such as 
chess, fall into this category, even though the number of states may be combi- 
natorially large. Continuous dynamical systems, such as a pendulum on a cart, 
may be approximated by a finite MDP through a discretization or quantization 
process. 

There is typically much less supervisory information available to an RL 
agent than is available in classical supervised learning. One of the central chal- 
lenges of reinforcement learning is that rewards are often extremely rare and 
may be significantly delayed from a sequence of good control actions. This chal- 
lenge leads to the so-called credit assignment problem, coined by Minsky 
to describe the challenge of knowing what action sequence was responsible for 
the reward ultimately received. These sparse and delayed rewards have been 
a central challenge in RL for six decades, and they are still a focus of research 
today. The resulting optimization problem is computationally expensive and 
data-intensive, requiring considerable trial and error. 

Today, reinforcement learning is being used to learn sophisticated control 
policies for complex open-world problems in autonomy and propulsion (e.g., 
self-driving cars, learning to swim and fly, etc.) and as a general learning envi- 
ronment for rule-constrained games (e.g., checkers, backgammon, chess, go, 
Atari video games, etc.). Much of the history of RL may be traced through 
the success on increasingly challenging board games, from checkers to 
backgammon and more recently to chess and go [659]. These games serve 
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Figure 11.2: Reinforcement learning is inspired by biological learning with 
sparse rewards. Mordecai was trained to balance a treat on his nose until a 
command is given, after which he catches it in the air. Credit: Bing Brunton for 
image and training. 


to illustrate many of the central challenges that are still faced in RL, including 
the curse of dimensionality and the credit assignment problem. 


Motivating Examples 


It is helpful to understand RL through simple examples. Consider a mouse in 
a maze. The mouse is the agent, and the environment is the maze. The mouse 
measures the local state in its environment; it does not have access to a full top- 
down view of the maze, but instead it knows its current local environment and 
what past actions it has taken. The mouse has agency to take some action about 
what to do next, for example, whether to turn left, turn right, or go forward. 
Typically, the mouse does not receive a reward until the end of the maze. If 
the mouse received a reward after each correct turn, it would have a much 
simpler supervised learning task. Setting such a curriculum is a strategy to help 
teach animals, whereby initially dense rewards are sparsified throughout the 
learning process. 

More generally, RL may be used to understand animal behavior, ranging 
from semi-supervised training to naturalistic behaviors. Figure [11.2] shows a 
trained behavior where a treat is balanced on Mordecai’s nose until a com- 
mand is given, after which he is able to catch it out of the air. Often, training 
animals to perform complex tasks involves expert human guidance to provide 
intermediate rewards or secondary reinforcers, such as using a clicker to in- 
dicate a future reward. In animal training and in RL, the more proximal the 
reward is in time to the action, the easier it is to learn the task. The connection 
between learning and temporal proximity is the basis of temporal difference (TD) 
learning, which is a powerful concept in RL, and this is also important to our 
understanding of the chemical basis for addiction [593]. 

It is also helpful to consider two-player games, such as tic-tac-toe, check- 
ers, backgammon, chess, and go. In these games, the agent is one of the play- 
ers, and the environment encompasses the rules of the game along with an 
adversarial opponent. These examples are also interesting because there is an 
element of randomness or stochasticity in the environment, either because of 
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the fundamental rules (e.g., a dice roll in backgammon) or because of an oppo- 
nent’s probabilistic strategy. Thus, it may be advantageous for the agent to also 
adopt a probabilistic policy, in contrast to much of the theory of classical control 
for deterministic systems. Similarly, a probabilistic strategy may be important 
when learning how to play. 

In most games, the reward signal comes at the end of the game after the 
agent has won or lost. Again, this makes the learning process exceedingly chal- 
lenging, as it is initially unclear which sub-sequence of actions were particu- 
larly important in driving the outcome. For example, an agent may play an 
excellent chess opening and mid-game and then lose at the end because of a 
few bad moves. Should the agent discard the entire first half of the game, or, 
worse yet, attribute this to a negative reward? Thus, it is clear that a major part 
of learning an effective policy is understanding the value of being in a given 
state s. In a game like chess, where the number of states is combinatorially 
large, there are too many states to count, and it is intractable to map out the ex- 
act value of all board states. Instead, players create simple heuristic rules about 
what are good board positions, e.g., assigning points to the various pieces to 
keep track of a rough score. This intermediate score provides a denser reward 
structure throughout the game. However, these heuristics are sub-optimal and 
may be susceptible to gambits, where the opponent sacrifices a piece for an im- 
mediate point loss in order to eventually move to a more favorable global state 
s. In backgammon, an intermediate point total may be more explicitly com- 
puted as the total number of pips, or points that a player must roll to move all 
pieces home and off the board. Although this makes it relatively simple to es- 
timate the strength of a board position, the discrete nature of the die roll and 
game mechanics makes this a sub-optimal approximation, and the number of 
required dice rolls or turns may be a more useful measure. 

Thinking through games like these illustrates many of the modern strategies 
to improve the learning rates and sample efficiency of RL, including hindsight 
replay, temporal difference learning, look ahead, and reward shaping, which 
we will discuss in the following sections. For example, playing against a skilled 
teacher can dramatically improve the learning rate, as the teacher provides 
guidance about whether or not a move is good, and why, adding information to 
help shape proxy metrics that can be used as intermediate rewards and models 
that can accelerate the learning process. 


Categorization of RL Techniques 


Nearly all problems in machine learning and control theory involve challeng- 
ing optimization problems. In the case of machine learning, the parameters of 
a model are optimized to best fit the training data, as measured by a loss func- 
tion. In the case of control, a set of control performance metrics are optimized 
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Figure 11.3: Rough categorization of reinforcement learning techniques. This 
organization is not comprehensive, and some of the lines are becoming blurred. 
The first major dichotomy is between model-based and model-free RL tech- 
niques. Next, within model-free RL, there is a dichotomy between gradient- 
based and gradient-free methods. Finally, within gradient-free methods, there 
is a dichotomy between on-policy and off-policy methods. 


subject to the constraints of the dynamics. Reinforcement learning is no differ- 
ent, as it is at the intersection of machine learning and control theory. 

There are many approaches to learn an optimal policy m, which is the ulti- 
mate goal of RL. A major dichotomy in reinforcement learning is that of model- 
based RL versus model-free RL. When there is a known model for the environ- 
ment, there are several strategies for learning either the optimal policy or value 
function through what is known as policy iteration or value iteration, which are 
forms of dynamic programming using the Bellman equation. When there is 
no model for the environment, alternative strategies, such as Q-learning, must 
be employed. The reinforcement learning optimization problem may be par- 
ticularly challenging for high-dimensional systems with unknown, nonlinear, 
stochastic dynamics, and sparse and delayed rewards. All of these techniques 
may be combined with function approximation techniques, such as neural net- 
works, for approximating the policy 7, the value function V, or the quality 
function Q (discussed in subsequent sections), making them more useful for 
high-dimensional systems. These model-based, model-free, and deep learning 
approaches will be discussed below. Figure [(11.3]summarizes the main organi- 
zation of these RL techniques. 

Note that this section only provides a glimpse of the many optimization 
approaches used to solve RL problems, as this is a vast and rapidly growing 
field. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


11.2. MODEL-BASED OPTIMIZATION AND CONTROL 513 


11.2 Model-Based Optimization and Control 


This section provides a high-level overview of some essential model-based op- 
timization and control techniques. Some people do not consider these tech- 
niques to be reinforcement learning, as they do not involve learning an opti- 
mal strategy through trial-and-error experience. However, the techniques are 
closely related. It is possible to learn a model through trial and error, and then 
use this model with these techniques, which would be considered RL. 

For the simplified case of a known model that is a finite MDP, it is possible 
to learn either the optimal policy or value function through what is known as 
policy iteration or value iteration, which are forms of dynamic programming us- 
ing the Bellman equation. Dynamic programming 
is a powerful approach that is used for general optimal nonlinear control and 
reinforcement learning, among other tasks. These algorithms provide a math- 
ematically simplified optimization framework that helps to introduce essential 
concepts used throughout. 

More generally, dynamic programming and RL optimization are related to 
the field of optimal nonlinear control, which has deep roots in variational the- 
ory going back to Bernoulli and the brachistochrone problem nearly four cen- 
turies ago. We will explore this connection to nonlinear control theory in Sec- 
tion [11.6 


Dynamic Programming 


Dynamic programming is a mathematical framework introduced by Richard 
E. Bellman to solve large multi-step optimization problems, such as 
those found in decision making and control. Policy iteration and value itera- 
tion, discussed below, are two examples of the use of dynamic programming 
in reinforcement learning. To solve these multi-step optimizations, dynamic 
programming reformulates the large optimization problem as a recursive opti- 
mization in terms of smaller sub-problems, so that only a local decision need 
be optimized. This approach relies on Bellman’s principle of optimality, which 
states that a large multi-step control policy must also be locally optimal in every 
sub-sequence of steps. 

The Bellman equation in indicates that the large optimization prob- 
lem over an entire state—action trajectory (Sx, ax) may be broken into a recursive 
optimization at each point along the trajectory. As long as the value function is 
known at the next point s’, it is possible to solve the optimization at point s 
simply by optimizing the policy m(s,a) at this point. Of course, this assumes 
that the value function is known at all possible next states s’ = s;,41, which is 
a function of the current state s+, the current action ax, and the dynamics gov- 
erning the system. This becomes even more complex for non-MDP dynamics, 
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such as the nonlinear control formulation in Section{11.6| For even moderately 
large problems, this approach suffers from the curse of dimensionality, and ap- 
proximate solution methods must be employed. 

When tractable, dynamic programming (i.e., the process of breaking a large 
problem into smaller overlapping sub-problems) provides a globally optimal 
solution. There are two main approaches to dynamic programming, referred to 
as top down and bottom up. 


(a) Top down: The top-down approach involves maintaining a table of sub- 
problems that are referred to when solving larger problems. For a new 
problem, the table is checked to see if the relevant sub-problem has been 
solved. If so, it is used, and, if not, the sub-problem is solved. This tabular 
storage is called memoization and becomes combinatorially complex for 
many problems. 


(b) Bottom up: The bottom-up approach involves starting by solving the 
smallest sub-problems first, and then combining these to form the larger 
problems. This may be thought of as working backwards from every pos- 
sible goal state, finding the best previous action to get there, then going 
back two steps, then going back three steps, etc. 


Although dynamic programming still represents a brute-force search through 
all sub-problems, it is still more efficient than a naive brute-force search. In 
some cases, it reduces the computational complexity to an algorithm that scales 
linearly with the number of sub-problems, although this may still be combina- 
torially large, as in the example of the game of chess. Dynamic programming 
is closely related to divide-and-conquer techniques, such as quick sort, except 
that divide and conquer applies to non-overlapping or non-recursive (i.e., inde- 
pendent) sub-problems, while dynamic programming applies to overlapping 
or recursively interdependent sub-problems. 

The recursive structure of dynamic programming suggests approximate so- 
lution techniques, such as the alternating directions method, where a sub-optimal 
solution is initialized and the value function is iterated over. This will be dis- 
cussed next. 


Policy Iteration 


Policy iteration is a two-step optimization procedure to simultaneously find an 
optimal value function V, and the corresponding optimal policy v. 

First, a candidate policy 7 is evaluated, resulting in the value function for 
this fixed policy. This typically involves a brute-force calculation of the value 
function for this policy starting at many or all initial states. The policy may need 
to be simulated for a long time depending on the reward delay and discounting 
factor y. 
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Next, the value function is fixed, and the policy is optimized to improve the 
expected rewards by taking different actions at a given state. This optimiza- 
tion relies on the recursive formulation of the value function due to Bellman’s 


equation (11.11): 
Vz(s) = E(R(s',s, m(s)) + yVx(s’)) (11.13a) 
= X_ P(s' | s, m(s))(R(s', s, m(s)) + WVn(s))). (11.13b) 


Note that, in this expression, we have assumed a deterministic policy a = 7 (s), 
otherwise would involve a second summation over a € A, with the 
expression multiplied by 7 (s, a). 

It is then possible to fix V,,(s’) and optimize over the policy in the first term. 
In particular, the new deterministic optimal policy at the state s is given by 


m(s) = argmax E(R(s’,s,a) + 7V,(s’)). (11.14) 
acA 


Once the policy is updated, the process repeats, fixing this policy to up- 
date the value function, and then using this updated value function to improve 
the policy. The process is repeated until both the policy and the value function 
converge to within a specified tolerance. It is important to note that this proce- 
dure is both expensive and prone to finding local minima. It also resembles the 
alternating descent method that is widely used in optimization and machine 
learning. 

The formulation in makes it clear that it may be possible to optimize 
backwards from a state known to give a reward with high probability. Addi- 
tionally, this approach requires having a model for P and R to predict the next 
state s’, making this a model-based approach. 


Value Iteration 


Value iteration is similar to policy iteration, except that at every iteration only 
the value function is updated, and the optimal policy is extracted from this 
value function at the end. First, the value function is initialized, typically either 
with zeros or at random. Then, for all states s € S, the value function is updated 
by returning the maximum value at that state across all actions a € A, holding 
the value function fixed at all other states s’ € S \ s: 


V(s) = max > P(s' | s,a)(R(s’,s,a) + yV(s’)). (11.15) 


This iteration is repeated until a convergence criterion is met. 
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After the value function converges, it is possible to extract the optimizing 
policy 7: 


m(s,a) = argmax X P(s’ | s,a)(R(s', s, a) + 7V(s')). (11.16) 


Although value iteration typically requires fewer steps per iteration, policy 
iteration often converges in fewer iterations. This may be due to the fact that 
the value function is often more complex than the policy function, requiring 
more parameters to optimize over. 

Note that the value function in RL typically refers to a discounted sum of 
future rewards that should be maximized, while in nonlinear control it refers 
to an integrated cost that should be minimized. The phrase value function is 
particularly intuitive when referring to accumulated rewards in the economic 
sense, as it quantifies the value of being in a given state. However, in the case 
of nonlinear control theory, the value function is more accurately thought of as 
quantifying the numerical value of the cost function evaluated on the optimal 
trajectory. This notation can be confusing and is worth careful consideration 
depending on the context. 


Quality Function 


Both policy iteration and value iteration rely on the quality function Q(s, a), 
which is defined as 


Q(s, a) = E(R(s’,s, a) + yV(s’)) (11.17a) 
by P(s’ | s,a)(R(s’,s,a) + yV(s’)). (11.17b) 


In a sense, the optimal policy 7(s, a) and the optimal value function V (s) con- 
tain redundant information, as one can be determined from the other via the 
quality function Q(s, a): 

7(s,a) = argmax Q(s, a), (11.18a) 


V(s) = max Q(s, a). (11.18b) 


This formulation will be used for model-free Q-learning [238| 746] in Sec- 
tion {11.3 


11.3 Model-Free Reinforcement Learning and Q-Learning 


Both policy iteration and value iteration above rely on the quality function 
Q(s, a), which describes the joint desirability of a given state—action pair. Policy 
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iteration (11.14) and value iteration are both model-based reinforcement 
learning strategies, where it is assumed that the MDP model is known: each it- 
eration requires a one-step look ahead, or model-based prediction of the next 
state s’ given the current state s and action a. Based on this model, it is possible 
to forecast and maximize over all possible actions. 

When a model is not available, there are several reinforcement learning ap- 
proaches to learn effective decision and control policies to interact with the en- 
vironment. Perhaps the most straightforward approach is to first learn a model 
of the environment using some data-driven active learning strategy, and then 
use the standard model-based approaches discussed earlier. However, this may 
be infeasible for very large or particularly unstructured systems. 

Q-learning is a leading model-free alternative, which learns the Q function 
directly from experience, without requiring access to a model. Thus, it is pos- 
sible to generalize many of the model-based optimization strategies above to 
more unstructured settings, where a model is unavailable. The Q function has 
the one-step look ahead implicitly built into its representation, without needing 
to explicitly refer to a model. From this learned Q function, the optimal policy 
and value function may be extracted as in (11.18). 

Before discussing the mechanics of Q-learning in detail, it is helpful to in- 
troduce several concepts, including Monte Carlo-based learning and temporal 
difference learning. 


Monte Carlo Learning 


In the simplest approach to learning from experience, the value function V or 
quality function Q may be learned through a Monte Carlo random sampling 
of the state—action space through repeated evaluation of many policies. Monte 
Carlo approaches require that the RL task is episodic, meaning that the task has 
a defined start and terminates after a finite number of actions, resulting in a 
total cumulative reward at the end of the episode. Games are good examples of 
episodic RL tasks. 

In Monte Carlo learning, the total cumulative reward at the end of the task 
is used to estimate either the value function V or the quality function Q by 
dividing the final reward equally among all of the intermediate states or state- 
action pairs, respectively. This is the simplest possible approach to deal with the 
credit assignment problem, as credit is shared equally among all intermediate 
steps. However, for this reason, Monte Carlo learning is typically quite sample- 
inefficient, especially for problems with sparse rewards. 

Consider the case of Monte Carlo learning of the value function. Given a 
new episode consisting of n steps, the cumulative discounted reward Ry is 
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computed as 
Ry => yr, (11.19) 
k=1 
and used to update the value function at every state są visited in this episode: 


VW (sp) = V” (sp) + L (Rs —V(s,)) Vke[1,...,n]. (11.20) 


This update, weighted by 1/n, is equivalent to waiting until the end of the 
episode and then updating the value function at all states along the trajectory 
with an equal share of the reward. Similarly, in the case of Monte Carlo learning 
of the Q function, the discounted reward Ry is used to update the Q function 
at every state—action pair (sx, ax) Visited in this episode: 


Q (sk, ak) = QHs, ay) + (Rs — Q% (sy, ax)) Vkefl,... n]. (11.21) 


In the limit of infinite data and infinite exploration, this approach will even- 
tually sample all possible state—action pairs and converge to the true quality 
function Q. However, in practice, this often amounts to an intractable brute- 
force search. 

It is also possible to discount past experiences by introducing a learning rate 
a € |0, 1] and using this to update the Q function: 


Q“ (Sk, ag) = Q% (sy, ag) +a (Ry = Q?4(s,, ay.)) Vke [1, suk nj. (11.22) 


Larger learning rates a > 1/n will favor more recent experience. 

There is a question about how to initialize the many episodes required to 
learn with Monte Carlo. When possible, the episode will be initialized ran- 
domly at every initial state or state—action pair, providing a random sampling; 
however, this might not be possible for many learning tasks. Typically, Monte 
Carlo learning is performed on-policy, meaning that the optimal policy is en- 
acted, based on the current value or quality function, and the information from 
this locally optimal policy is used for the update. It is also possible to pro- 
mote exploration by adding a small probability of taking a random action, 
rather than the action dictated by the optimal policy. Finally, there are off-policy 
Monte Carlo methods, but, in general, they are quite inefficient or unfeasible. 


Temporal Difference (TD) Learning 
Temporal difference learning [103 700], known as TD learning, 


is another sample-based learning strategy. In contrast to Monte Carlo learn- 
ing, TD learning is not restricted to episodic tasks, but instead learns continu- 
ously by bootstrapping based on current estimates of the value function V or 
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quality function Q, as in dynamic programming (e.g., as in value iteration in 
(11.15)). TD learning is designed to mimic learning processes in animals, where 
time-delayed rewards are often learned through environmental cues that act as 
secondary reinforcers preceding the delayed reward; this is most popularly un- 
derstood through the story of Pavlov’s dog [551]. Thus, TD learning is typically 
more sample efficient than Monte Carlo learning, resulting in decreased vari- 
ance, but at the cost of a bias in the learning due to the bootstrapping. We will 
demonstrate TD learning of the value function, although it can also be used to 
learn the quality function. 


TD(0): One-Step Look Ahead 


To understand TD learning, it is helpful to begin with the simplest algorithm: 
TD(0). In TD(0), the estimate of the one-step-ahead future reward is used to 
update the current value function. 

Given a control trajectory generated through an optimal policy zr, the value 
function at state s;, is given by 


V (se) = E(t, + VV (Se+1))- (11.23) 


In the language of Bayesian statistics, ry + yV(s,+1) is an unbiased estimator for 
V(sz). 

Instead of using a model to predict s41, which is required to evaluate V (s;+1), 
it is possible to wait until the next step is actually taken and retroactively adjust 
the value function: 


TD error 


Vr" (sp) = VO" (s) +a (a +V” (sp41) va) (11.24) 
eS ama 


TD target estimates Ry 


For non-optimal policies m, this same idea may be used to update the value 
function based on the value function one step in the future. Notice that this is 
very similar to optimization of the Bellman equation using dynamic program- 
ming but with retroactive updates based on sampled data rather than proactive 
updates based on a model prediction. 

In the TD(0) update above, the expression Ry = rk + YV (Sk+1) is known 
as the TD target, as it is the estimate for the future reward, analogous to Ry in 
Monte Carlo learning of the Q function in (11.22). The difference between this 
target and the previous estimate of the value function is the TD error, and it 
is used to update the value function, just as in Monte Carlo learning, with a 
learning rate a. 
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TD(n): n-Step Look Ahead 


Other temporal difference algorithms can be developed, based on multi-step 
look-aheads into the future. For example, TD(1) uses a TD target based on two 
steps into the future, 


RQ? = rp + Yre + PV (8k42), (11.25) 
and TD(n) uses a TD target based on n + 1 steps into the future, 
RY = rp H weet Yrkt H H Y ten $V (Sktn+1) (11.26a) 
= (>: vn) PYV (Spent): (11.26b) 
j=0 


Again, there does not need to be a model for these future states, but, instead, 
the value function may be retroactively adjusted based on the actual sampled 
trajectory and rewards. Note that in the limit that an entire episode is used, 
TD(n) converges to the Monte Carlo learning approach. 


TD-A: Weighted Look Ahead 


An important variant of the TD learning family is TD-A, which was introduced 
by Sutton [682]. TD-A creates a TD target RÀ that is a weighted average of the 


various TD(n) targets RY) . The weighting is given by 


OO 


R= =I 5 Ry? (11.27) 
n=1 
and the update equation is 
VY (s4) = V” (s4) + (RA — V” (s4)). (11.28) 


TD-A was used for an impressive demonstration in the game of backgammon 
by Tesauro in 1995 [700]. 


TD learning provides one of the strongest connections between reinforcement 
learning and learning in biological systems. Neural circuits are believed to es- 
timate the future reward, and feedback is based on the difference between the 
expected reward and the actual reward, which is closely related to the TD er- 
ror. In fact, there are specific neurotransmitter feedback loops that strengthen 
connections based on proximity of their firing to a dopamine reward signal 
[282]. The closer the proximity in time between an action and a reward, 
the stronger the feedback, which has implications for addiction. 
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Bias—Variance Tradeoff 


Monte Carlo learning and TD learning exemplify the bias—variance tradeoff in 
machine learning. Monte Carlo learning typically has high variance but no bias, 
while TD learning has lower variance but introduces a bias because of the boot- 
strapping. Although the true TD target r + yV (Sk+1) is an unbiased estimate 
of V (są) for an optimal policy m, the sampled TD target is a biased estimate, 
because it uses sub-optimal actions and the current imperfect estimate of the 
value function. 


SARSA: State—Action—Reward-State—Action Learning 


SARSA is a popular TD algorithm that is used to learn the Q function on-policy. 
The Q update equation in SARSA(0) is nearly identical to the V update equation 


(11.24) in TD(0): 
Q" (sx, ax) = Q™ (sp, ax) +. (rk + YQ (Sk41, ak1) — Q™ (sk, ak)). (11.29) 


There are SARSA variants for all of the TD(n) algorithms, based on the n-step 
TD target: 


Ro =ret iri ty taa te Tea tI Olatea) 130a) 


= Prag ty GQ Spear arin): (11.30b) 
j=0 


In this case, the SARSA(n) update equation is given by 


Q" (sp, ak) = Q% (sp, ap) +a (RY? — Q” (sp, ax) . (11.31) 


Note that this is on-policy because the actual action sequence ag, ak+1, - - - , Ak+n+1 
has been used to receive the rewards r and evaluate the (n + 1)-step Q function 
Q (Sk+n+1; aking): 


Q-Learning 


We are now ready to discuss Q-learning [238 746], which is one of the most 
widely used approaches in model-free RL. Q-learning is essentially an off-policy 
TD learning scheme for the Q function. In Q-learning, the Q update equation is 


Q (sp, ap) = Q% (sp, ax) +a (r + ymax Q(Sk41,a) — Q% (sp, ax) - (11.32) 


Notice that the only difference between Q-learning and SARSA (0) is that SARSA (0) 
uses Q(S}+1, ax+1) for the TD target, while Q-learning uses maxa Q(s;41, a) for 
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the TD target. Thus, SARSA(0) is considered on-policy because it uses the action 
a,41 based on the actual policy: ak}ı = 7(s,+1). In contrast, Q-learning is off- 
policy because it uses the optimal a for the update based on the current estimate 
for Q, while taking a different action a;,; based on a different behavior policy. 
Thus, Q-learning may take sub-optimal actions a; to explore, while still using 
the optimal action a to update the Q function. 

Generally, Q-learning will learn a more optimal solution faster than SARSA, 
but with more variance in the solution. However, SARSA will typically yield 
more cumulative rewards during the training process, since it is on-policy. In 
safety-critical applications, such as self-driving cars or other applications where 
there can be catastrophic failure, SARSA will typically learn less optimal solu- 
tions, but with a better safety margin, since it maximizes on-policy rewards. 

Q-learning applies to discrete action spaces A and state spaces S governed 
by a finite MDP. The Q function is classically represented as a table of Q values 
that is updated through some iteration based on new information as a policy 
is tested and evaluated. However, this tabular approach does not scale well to 
large state spaces, and so typically function approximation is used to represent 
the Q function, such as a neural network in deep Q-learning. Even if the ac- 
tion and state spaces are continuous, as in the pendulum on a cart system, it is 
possible to discretize and then apply Q-learning. 

In addition to being model-free, Q-learning is also referred to as off-policy 
RL, as it does not require that an optimal policy is enacted, as in policy it- 
eration and value iteration. Off-policy learning is more realistic in real-world 
applications, enabling the RL agent to improve when its policy is sub-optimal 
and by watching and imitating other more skilled agents. Q-learning is espe- 
cially good for games, such as backgammon, chess, and go. In particular, deep 
Q-learning, which approximates the Q function using a deep neural network, 
has been used to surpass the world champions in these challenging games. 


Experience Replay and Imitation Learning 


Because Q-learning is off-policy, it is possible to learn from action-state se- 
quences that do not use the current optimal policy. For example, it is possible 
to store past experiences, such as previously played games, and replay these 
experiences to further improve the Q function. 

In an on-policy strategy, such as SARSA, using actions that are sub-optimal, 
based on the current optimal policy, will degrade the Q function, since the TD 
target will be a flawed estimate of future rewards based on a sub-optimal ac- 
tion. However, in Q-learning, since the action is optimized over the current Q 
function in the update, it is possible to learn from experience resulting from 
sub-optimal actions. This also makes it possible to learn from watching other, 
more experienced agents, which is related to imitation learning 
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625]. 

Experience replay is deeply intuitive, as it is closely related to how we learn, 
through recalling past experiences in the light of new knowledge (i.e., an up- 
dated Q function). Similarly, imitation learning is perhaps one of the most fun- 
damental first steps in biological learning. 


Exploration versus Exploitation: e-Greedy Actions 


It is important to introduce an element of random exploration into Q-learning, 
and there are several techniques. One approach is the e-greedy algorithm to 
select the next action. In this approach, the agent takes the current optimal ac- 
tion a, = max, Q(s,, a), based on the current Q function, with probability 1 — e, 
where e € [0,1]. With probability e, the agent takes a random action. Thus, the 
agent balances exploration with the random actions, and exploitation with the 
optimal actions. Larger €e promotes more random exploration. 

Typically, the value of e will be initialized to a large value, often e = 1. 
Throughout the course of training, e decays so that, as the Q function improves, 
the agent increasingly takes the current optimal action. This is closely related 
to simulated annealing from optimization, which mimics the process of forging 
metal to find a low-energy state through a specific cooling schedule. 


Policy Gradient Optimization 


Policy gradient optimization is a powerful technique to optimize 
a policy that is parameterized, as in (11.2). When the policy 7 is parameterized 
by @, it is possible to use gradient optimization on the parameters to improve 
the policy much faster than through traditional iteration. The parameterization 
may be a multi-layer neural network, in which case this would be a deep policy 
network, although other representations and function approximations may be 
useful. In any case, instead of extracting the policy as the argument maximizing 
the value or quality functions, it is possible to directly optimize the parameters 
0, for example through gradient descent or stochastic gradient descent. The 
value function V,(s), depending on a policy m, then becomes V(s,@), and a 
similar modification is possible for the quality function Q. 
The total estimated reward is given by 


Rs.o =E (Q(s, a)) = ` Hols) ` T9(s, a)Q(s, a), (11.33) 


sES acA 


where jtg is the asymptotic steady state of the MDP given a policy me param- 
eterized by 0. It is then possible to compute the gradient of the total estimated 
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reward with respect to 0: 


Vols = X. Hols) X_ Q(s, a)VoTo(s, a) (11.34a) 
sES acA 
Vo7e(s, a) 
= Hols Tels, a)Q(s, a) ————— (11.34b) 
Y Hels) Yo mols 9) 6,9) a a 
= ` He (S) pS Tel(s, a)Q(s, a) Vo log(m6(s, a)) (11.34c) 
seS acA 
= E(Q(s, a)Vo log(7(s, a))). (11.34d) 
Then the policy parameters may be updated as 
grew = 0% + aVoRso, (11.35) 


where a is the learning weight; note that a may be replaced with a vector of 
learning weights for each component of 0. There are several approaches to 
approximating this gradient, including through finite differences, the REIN- 
FORCE algorithm [759], and natural policy gradients [368]. 


11.4 Deep Reinforcement Learning 


Deep reinforcement learning is one of the most exciting areas of machine learn- 
ing and of control theory, and it is one of the most promising avenues of re- 
search towards generalized artificial intelligence. Deep learning has revolu- 
tionized our ability to represent complicated functions from data, providing 
a set of architectures for achieving human-level performance in complex tasks 
such as image recognition and natural language processing. Classic reinforce- 
ment learning suffers from a representation problem, as many of the relevant 
functions, such as the policy ~, the value function V, and the quality func- 
tion Q, may be exceedingly complicated functions defined over a very high- 
dimensional state and action space. Indeed, even for simple games, such as 
the 1972 Atari game Pong, the black-and-white screen at standard resolution 
336 x 240 has over 1074°° possible discrete states, making it infeasible to repre- 
sent any of these functions exactly without approximation. Thus, deep learning 
provides a powerful tool for improving these representations. 

It is possible to use deep learning in several different ways to approximate 
the various functions used in RL, or to model the environment more generally. 
Typically, the central challenge is in identifying and representing key features 
in a high-dimensional state space. For example, the policy 7(s, a) may now be 
approximated by 


n(s,a) © 7(s,a, 9), (11.36) 
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Figure 11.4: Deep policy network to encode the probability of moving up in 
the game of Pong. Inspired by Andrej Karpathy’s Blog, “Deep Reinforcement 
Learning: Pong from Pixels” at |http: //karpathy.github.io/2016/05/ 
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Figure 11.5: Convolutional structure of deep Q network used to play Atari 
games. Reproduced with permission from [506]. 


where @ represent the weights of a neural network. 

This pairing of deep learning for representations with reinforcement learn- 
ing for decision making and control has resulted in dramatic improvements to 
the capabilities of reinforcement learning. For example, Fig. [11.4]shows a sim- 
ple policy network designed to play Pong, and Fig. [11.5|shows a more general 
deep convolutional neural network architecture used to develop a deep Q net- 
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work to play Atari games at human levels of performance [506]. 

Much of what is discussed in this section is also relevant for other function 
approximation techniques besides deep learning. For example, policy gradi- 
ents may be computed and used for gradient-based optimization using other 
representations of the form of (11.36), and there is a long history before deep 
learning [684]. That said, many of the most exciting and impressive re- 
cent demonstrations of RL leverage the full power of deep learning, and so we 
present these innovations in this context. 


Deep Q-Learning 


Many of the most exciting advances in RL over the past decade have involved 
some variation of deep Q-learning, which uses deep neural networks to rep- 
resent the quality function Q. As with the policy in (11.36), it is possible to 
approximate the Q function through some parameterization 0, 


Q(s,a) ~ Q(s, a, 9), (11.37) 


where 0 represents the weights of a deep neural network. In this representation, 
the training loss function is directly related to the standard Q-learning update 
in (11.32): 


L=E I(r + ymax Q(si41,a, 8) — Q(Sr, ax, 0))?] . (11.38) 


The first part of the loss function, ry + y maxa Q(Sk+1, a, 9), is the temporal dif- 
ference target from before, and the second part, ((s;,, ax, 9), is the prediction. 

Deep reinforcement learning based on a deep Q network (DQN) was in- 
troduced by Mnih et al. to play Atari games. Specifically, this network 
used a deep convolutional neural network to represent the Q function, where 
the inputs were pixels from the Atari screen and actions were joystick motions, 
as shown in Fig. In this original paper, both Q functions in were 
represented by the same network weights 6. However, in a double DON [730], 
different networks are used to represent the target and prediction Q functions, 
which reduces bias due to inaccuracies early in training. In double DQN, it 
may be necessary to fix the target network for multiple training iterations of 
the prediction network before updating to improve stability and convergence 
|258]. 

Experience replay is a critical component of training a DQN, which is possi- 
ble because it is an off-policy RL algorithm. Short segments of past experiences 
are used in batches for stochastic gradient descent during training. Moreover, 
to place more importance on experiences with large model mismatch, it is pos- 
sible to weight past experiences by the magnitude of the TD error. This process 
is known as prioritized experience replay [630]. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


11.4. DEEP REINFORCEMENT LEARNING 527 


Dueling deep Q networks (DDQNs) are another important deep Q 
learning architecture that are used to improve training when actions have a 
marginal affect on the quality function. In particular, a DDQN splits the quality 
function into the sum of a value function and an advantage function A(s, a), 
which quantifies the additional benefit of a particular action over the value of 
being in that state: 


Q(s,a,0) = V (s, 01) + A(s, a, 02). (11.39) 


Thus, separate value and advantage networks are combined to estimate the Q 
function. 

There are a variety of other useful architectures for deep Q learning, with 
more introduced regularly. For example, deep recurrent Q networks are promis- 
ing for dynamic problems [316]. Advantage actor-critic networks, discussed in 
the next section, combine the DDQN with deep policy networks. 


Actor—Critic Networks 


Actor-critic methods in reinforcement learning simultaneously learn a policy 
function and a value function, with the goal of taking the best of both value- 
based and policy-based learning. The basic idea is to have an actor, which is 
policy-based, and a critic, which is value-based, and to use the temporal dif- 
ference signal from the critic to update the policy parameters. There are many 
actor—critic methods that pre-date deep learning. For example, a simple actor- 
critic approach would update the policy parameters 0 in using the tem- 
poral difference error ry + YV (sk+1) — V (Sz): 


0k41 = Ok + aVo((log T (Sk, ak, 0)) (£k + YV (Sk+1) — V (sx)). (11.40) 


It is rather straightforward to incorporate deep learning into an actor-critic 
framework. For example, in the advantage actor-—critic (A2C) network, the actor 
is a deep policy network, and the critic is a DDQN. In this case, the update is 
given by 


Oy 41 = Ok = aVo((log T (Sp, ak, 0))Q (Sk, ak, 02)). (11.41) 


Challenges and Additional Techniques 


There are several important innovations that are necessary to make reinforce- 
ment learning tractable for even moderately challenging tasks. Two of the biggest 
challenges in RL are: (1) high-dimensional state and action spaces, and (2) sparse 
and delayed rewards. 

Many games, such as chess and go, have exceedingly large state spaces. For 
example, Claude Shannon estimated the number of possible games of chess, 
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known as the Shannon number, at around 10'”° in his famous paper “Program- 
ming a computer for playing chess” ; this paper was a major inspiration 
for modern dynamic programming and reinforcement learning. Representing a 
value or quality function, let alone sampling over these states, is beyond astro- 
nomically difficult. Thus, approximate representations of the value or quality 
functions using approximation theory, such as deep neural networks, are nec- 
essary. 

Sparse and delayed rewards represent the central challenge of reinforce- 
ment learning, leading to the well-known credit assignment problem, which 
we have seen multiple times at this point. The following techniques, includ- 
ing reward shaping and hindsight experience replay, are leading techniques to 
overcome the credit assignment problem. 


Reward Shaping 


Perhaps the most standard approach for systems with sparse rewards is a tech- 
nique called reward shaping. This involves designing customized proxy fea- 
tures that are indicative of a future reward and that may be used as an interme- 
diate reward signal. For example, in the game of chess, the relative point count, 
where each piece is assigned a numeric value (e.g., a queen is worth 10 points, 
rooks are worth 5, knights and bishops are worth 3, and pawns are worth 1 
point), is an example of a shaped reward that gives an intermediate reward 
signal each time a piece is taken. 

Reward shaping is quite common and can be very effective. However, these 
rewards require expert human guidance to design, and this requires customized 
effort for each new task. Thus, reward shaping is not a viable strategy for a 
generalized artificial intelligence agent capable of learning multiple games or 
tasks. In addition, reward shaping generally limits the upper end of the agent’s 
performance to that of the human expert. 


Hindsight Experience Replay 


In many tasks, such as robotic manipulation, the goal is to move the robot or 
an object from one location to another. For example, consider a robot arm that 
is required to slide an object on a table from point A to point B. Without a de- 
tailed physical model, or other prior knowledge, it is extremely unlikely that 
a random control policy will result in the object actually reaching the desired 
destination, so the rewards may be very sparse. It is possible to shape a reward 
based on the distance of the object to the goal state, although this is not a gen- 
eral strategy and suffers from the limitations discussed above. 

Hindsight experience replay (HER) is a strategy that enriches the 
reward signal by taking failed trials and pretending that they were successful at 
a different task. This approach makes the reward structure much more dense, 
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and has the benefit of enabling the simultaneous learning of a whole family of 
motion tasks. 

HER is quite intuitive in the context of human learning, for example in the 
case of tennis. Initially, it is difficult to aim the ball, and shots often go wild 
when learning. However, this provides valuable information about those mus- 
cle actions that might be useful for future tasks. After lots of practice, it then 
becomes possible to pick from different shots and place the ball more deliber- 
ately. 


Curiosity-Driven Exploration 


Another challenge with RL for large open-world environments is that the agent 
may easily get stuck in a local minimum, where it over-optimizes for a small 
region of state space. One approach to this problem is to augment the reward 
signal with a novelty reward that is large in regions of state space that are not 
well modeled. This is known as curiosity-driven exploration [549], and it in- 
volves an intrinsic curiosity module (ICM), which compares a forward model 
of the evolution of the state, or a latent representation of the state, with the 
actual observed evolution. The discrepancy between the model and the actual 
dynamics is the novelty reward. When this difference is large, the agent be- 
comes curious and explores this region more. There are similarities between 
this approach and TD learning, and, in fact, many of the same variations may 
be implemented for curiosity-driven exploration. The main difference is that, in 
TD learning, the reward discrepancy is used as feedback to improve the value 
or quality function; while, in curiosity-driven exploration, the discrepancy is 
explicitly used as an additional reward signal. This is a clever approach to em- 
bedding this fundamental behavior of intelligent biological learning systems, 
to be curious and explore. 

There are challenges when using this novelty reward for chaotic and stochas- 
tically driven systems, where there are aspects of the state evolution that are 
fundamentally unpredictable. A naive novelty reward would constantly pro- 
vide positive incentive to explore these regions, since the forward model will 
not improve. Instead, the authors in overcome this challenge by predicat- 
ing novelty on the predictability of an outcome given the action, using latent 
features in an autoencoder, so only aspects of the future state that can be af- 
fected by the agent’s actions are included in the novelty signal. 


11.5 Applications and Environments 


Here we provide a brief overview of some of the modern applications and suc- 
cess stories of RL, along with some common environments. 
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OpenAI Gym 


The OpenAI Gym is an incredible open-source resource to develop and test 
reinforcement learning algorithms in a wide range of environments. Figure|11.6| 
shows a small selection of these systems. Example environments include the 
following. 


e Classic Atari video games: over 100 tasks on Atari 2600 games, including 
asteroids, breakout, space invaders, and many others. 


e Classic control benchmarks: tasks include balancing an inverted pendu- 
lum on a cart, swing-up of a pendulum, swing-up of a double pendulum, 
and driving up a hill with an under-actuated system. 


e Goal-based robotics [769]: tasks include pushing or fetching a block to 
a goal position with a robot arm, with and without sliding after loss of 
contact, and robotic hand manipulation for reaching a pose or orienting 
various objects. 


e MuJoCo [706]: tasks include multi-legged locomotion, running, hopping, 
swimming, etc., within a fast physics simulator environment. 


This wide range of environments and tasks provides an invaluable resource for 
RL researchers, dramatically lowering the barrier to entry and facilitating the 
benchmarking and comparison of new algorithms. 


Classic Board Games 


As discussed throughout this chapter, RL has developed tremendously over the 
past half-century, from a biologically inspired idea to a major field at the fore- 
front of generalized artificial intelligence. This progress can be largely traced 
through the success of RL on increasingly challenging games, where RL has 
learned to interact with and mimic humans, and eventually to defeat our great- 
est Grandmasters. 

Many of the most fundamental advances in RL were either developed for 
the purpose of playing games, or demonstrated on the most challenging games 
of the time. These simple board games also make the struggles of machine 
learning and artificial intelligence more relatable to humans}‘|as we can reflect 
on our own experiences learning first how to play tic-tac-toe, then checkers, 
and then eventually “real” games, such as backgammon, chess, and go. The 
progression of RL capabilities roughly follows this progression of complexity, 


2“ A strange game. The only winning move is not to play. How about a nice game of chess?” 
— WarGames, 1983. 
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Atari 
Play to get high score in Atari 2600 games 


Spacelnvaders-v0 Qbert-vO Breakout-vO 


MuJoCo 


Physics emulator for continuous control tasks 


Humanoid-v2 HalfCheetah-v2 Ant-v2 


Robotics 
Goal-based robotics tasks 


FetchPickAndPlace-v1 FetchPush-v1 FetchReach-v1 
Figure 11.6: The OpenAI Gym provides 
a flexible simulation environment to test learning strategies. Examples include 
classic Atari 2600 video games and simulated rule-based control environments, 
including open-world physics and robotics [769]. Other examples include 
classic control benchmarks. 


with tic-tac-toe being essentially a homework exercise, checkers being the earli- 
est real demonstration of RL by Arthur Samuel [616], and more complex games 
such as backgammon [7/00] and eventually chess and go |658; |661] following. 
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Onie27:15 


Figure 11.7: Reinforcement learning has demonstrated incredible perfor- 
mance in recent expert tasks, such as AlphaGo defeating world champion 
Lee Sedol in the game of go on March 19, 2016. Reproduced from 
https://uww.flickr.com/photos/erikbenson/25717574115. 


Interestingly, about three decades passed between each of these definitive land- 
marks. One of the next major landmarks was a recent generalist RL agent that 
can learn to play multiple games [659], rather than specializing in only one task. 

The success of DeepMind’s AlphaGo and AlphaGo Zero demonstrates the 
remarkable power of modern RL. This system was a major breakthrough in 
RL research, learning to beat the Grandmaster Lee Sedol 4-1 in 2016, depicted 
in Fig. However, AlphaGo relied heavily on reward shaping and expert 
guidance, making it a custom solution, rather than a generalized learner. Its 
successor, AlphaGo Zero, relied entirely on self-play, and was able to even- 
tually defeat the original AlphaGo decisively. AlphaGo was based largely on 
CNNs, while AlphaGo Zero used a residual network (ResNet). ResNets are eas- 
ier to train, and AlphaGo Zero was one of the first concrete success stories that 
cemented the ResNet as a competitive architecture. AlphaGo Zero was trained 
in 40 days on four tensor processing units, in contrast to many advanced ML 
algorithms that are trained for months on thousands of GPUs. Both AlphaGo 
and AlphaGo Zero are based on using deep learning to improve a Monte Carlo 
tree search. 


Video Games 


Some of the most impressive recent innovations in RL have involved scaling up 
to larger input spaces, which are well exemplified by the ability of RL to mas- 
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ter classic Atari video games [506]. In the case of Atari games, the pixel space 
is processed using a CNN architecture, with human-level performance being 
achieved mere years after the birth of modern deep learning for image classi- 
fication [414]. More recently, RL has been demonstrated on more sophisticated 
games, such as StarCraft [735], which is a real-time strategy game; DeepMind’s 
AlphaStar became a Grandmaster in 2019. 

General artificial intelligence is one of the grand challenge problems in mod- 
ern machine learning, whereby a learning agent is able to excel at multiple 
tasks, as in biological systems. What is perhaps most impressive about recent 
RL agents that learn video games is that the learning approach is general, so that 
the same RL framework can be used to learn multiple tasks. There is evidence 
that video games may improve performance in human surgeons 607], 
and it may be that future RL agents will master both robotic manipulation and 
video games in a next stage of generalized AI. 


Physical Systems 


Although much of RL has been developed for board games and video games, it 
is increasingly being used for various advanced modeling and control tasks in 
physical systems. Physical systems, such as lasers and fluids [581], often 
require additional considerations, such as continuous state and action spaces 
[589], and the need for certifiable solutions, such as trust regions [644], for 
safety-critical applications (e.g., transportation, autonomous flight, etc.). 

There has been considerable work applying RL in the field of fluid dy- 
namics for fluid flow control 582], for example for bluff- 
body control and controlling Rayleigh-Bénard convection [66]. RL has 
also been applied to the related problem of navigation in a fluid environment 
[85] (180) [304], and more recently for turbulence modeling [530]. 

In addition to studying fluids, there is an extensive literature using RL to 
develop control policies for real and simulated robotic systems that operate 
primarily in a fluid environment, for example to learn how to fly and swim. 
For example, some of the earliest work has involved optimizing the flight of 
uninhabited aerial vehicles with especially impres- 
sive helicopter aerobatics [2]. Controlling the motion of fish 
is another major area of development, including individual and collective 
motion [733]. Gliding and perching is another large area of develop- 
ment [531}|590) 591]. 


Robotics and Autonomy 
Robotics [[300}|394] and autonomy [545 652] are two of the largest ar- 


eas of current research in RL. These both count as physical systems, as in the 
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Generation 
900 


Figure 11.8: Illustration of improved bipedal locomotion performance with 
more generations of learning. Reproduced with permission from Geijtenbeek 
et al. [271]. 


section above, but deserve their own treatment, as these are major areas of in- 
novation. In fact, both robotics and autonomy may be viewed as two of the 
most pressing societal applications of machine learning in general, and rein- 
forcement learning in particular, with self-driving cars alone promising to re- 
shape the modern transportation and energy landscape. As with the discussion 
of physical systems above, these are typically safety-critical applications with 
physical constraints [431] |696]. Figure|11.8]shows a virtual locomotion task that 
involves learning physics in a robot walker. 


11.6 Optimal Nonlinear Control 


Reinforcement learning has considerable overlap with optimal nonlinear con- 
trol, and historically they were developed in parallel under the same optimiza- 
tion framework. Here we provide a brief overview of optimal nonlinear control 
theory, which will provide a connection between the classic linear control the- 
ory from Chapter |8| and dynamic programming to solve Bellman’s equations 
used in this chapter. We have already seen optimal control in the context of lin- 
ear dynamics and quadratic cost functions in Section|8.4] resulting in the linear- 
quadratic regulator (LOR). Similarly, we have used Bellman’s equations to find 
optimal policies in RL for systems governed by MDPs. A major goal of this 
section is to provide a more general mathematical treatment of Bellman’s equa- 
tions, extending these approaches to fully nonlinear optimal control problems. 
However, this section is very technical and departs from the MDP notation 
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used throughout the rest of the chapter; it may be omitted on a first reading. 
For more details, see the excellent text by Stengel [675]. 


Hamilton-Jacobi-Bellman Equation 


In optimal control, the goal is often to find a control input u(t) to drive a dy- 
namical system, 


d 
ak = Fx), ult), t), (11.42) 


to follow a trajectory x(t) that minimizes a cost function, 


ty 
J(x(t), u(t), to, tf) = Q(x(ts), tf) + f L(x(T),u(T))dr. (11.43) 
to 
Note that this formulation in generalizes the LQR cost function in ; 
now the immediate cost function £(x, u) and the terminal cost Q(x(t;),t;) may 
be non-quadratic functions. Often there are also constraints on the state x and 
control u, which determine what solutions are admissible. 

Given an initial state x) = x(to) at to, an optimal control u(t) will result in 
an optimal cost function J. We may define a value function V (x, to, ts) that de- 
scribes the total integrated cost starting at this position x assuming the control 
law is optimal: 


V(x(to), to, ty) = ai J(x(t), u(t), to, tf), (11.44) 


where x(t) is the solution to for the optimal u(t). Notice that the value 
function is no longer a function of the control u(t), as this has been optimized 
over, and it is also not a function of a trajectory x(t), but rather of an initial 
state xy, as the remainder of the trajectory is entirely specified by the dynamics 
and the optimal control law. The value function is often called the cost-to-go in 
control theory, as the value function evaluated at any point x(t) on an optimal 
trajectory will represent the remaining cost associated with continuing to enact 
this optimal policy until the final time ¢;. In fact, this is a statement of Bellman’s 
optimality principle, that the value function V remains optimal starting with 
any point on an optimal trajectory. 

The Hamilton-Jacobi—Bellmary}| (HJB) equation establishes a partial differ- 
ential equation that must be satisfied by the value function V (x(t), t, tf) at every 


3Kalman recognized that the Bellman optimal control formulation was a generalization of 
the Hamilton-Jacobi equation from classical mechanics to handle stochastic input-output sys- 
tems. These formulations all involve the calculus of variations, which traces its roots back to 
the brachistochrone problem of Johann Bernoulli. 
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intermediate time t € [to, tf]: 


ae. ffav\* 
Fe = min ( (SE). teoa) + £000.00). (11.45) 


To derive the HJB equation, we may compute the total time derivative of 
the value function V (x(t), t,tp) at some intermediate time t: 


d ðV (OV\* dx 
qe’ (x). tts) = ae 4 (x) T (11.46a) 
t 
-min & ( f cauar) 014w) 
b t 
ale “r 1 
= min T J, (x(T),u(T))dr (11.46c) 
—L£(x(t),u(t)) 
a. sf fav" 
aa a vin ( (2) xu) + Liam) (11.46d) 


Note that the terminal cost does not vary with t, so it has zero time derivative. 
The derivative of the integral of the instantaneous cost I L(x(T),u(T)) dr is 
equal to —L (x(t), u(t) ) by the first fundamental theorem of calculus. Finally, the 
term (OV/0x)"f(x, u) may be brought into the minimization argument, since V 
is already defined as the optimal cost over u. The LOR optimal Riccati equation 
is a special case of the HJB equation, and the vector of partial derivatives in 
(0J/Ox) serves the same role as the Lagrange multiplier co-state A. The HJB 
equation may also be more intuitive in vector calculus notation 


avo 
E A EN E (11.47) 


The HJB formulation above relies implicitly on Bellman’s principle of opti- 
mality, namely that for any point on an optimal trajectory x(t), the value func- 
tion V is still optimal for the remainder of the trajectory: 


V (x(t), t,t¢) = min (| ' L(x(T), u(T))dr + Qll): ). (11.48) 


One outcome is that the value function can be decomposed as 


V(x(to), to, t) = V(x(to), to, t) + V (x(t), t,tp). (11.49) 
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This makes it possible to take the total time derivative above. A more rigorous 
derivation is possible using the calculus of variations. 

The HJB equation is incredibly powerful, providing a PDE for the optimal 
solution of general nonlinear control problems. Typically, the HJB equation 
is solved numerically as a two-point boundary value problem, with bound- 
ary conditions x(0) = xo and V(x(t;),ty,tr) = Q(x(tr),ty), for example us- 
ing a shooting method. However, a nonlinear control problem with a three- 
dimensional state vector x € R? will result in a three-dimensional PDE. Thus, 
optimal nonlinear control based on the HJB equation typically suffers from the 
curse of dimensionality. Phase-space clustering techniques have shown great 
promise in reducing the effective state-space dimension for systems that evolve 
on a low-dimensional attractor [367]. 


Discrete-Time HJB and the Bellman Equation 


Bellman’s optimal control is especially intuitive for discrete-time systems, where, 
instead of optimizing over a function, we optimize over a discrete control se- 
quence. Consider a discrete-time dynamical system 


Xk+1 = F (xz, ux). (11.50) 
The cost is now given by 
I (xo, {uc }paj 5) = D L(xe, Ue) + Otn tn). (11.51) 
k=j 


Similarly, the value function is defined as the value of the cumulative cost func- 
tion, starting at a point xo assuming an optimal control policy u: 


V(xo,0,n) = an J(xXo, {ux }F_, 0, n). (11.52) 


Uk f k=0 


Again, Bellman’s principle of optimality states that an optimal control policy 
has the property that, at any point along the optimal trajectory x(t), the re- 
maining control policy is optimal with respect to this new initial state. Mathe- 
matically, 


V(xo,0,n) = V(x0,0,k) +V (xk, k,n), Yke (0,n). (11.53) 


Thus, the value at an intermediate time-step k may be written as 


V(x, k,n) = (rin L(Xk, w) +V (Xk, k +1,n) (11.54a) 
uk SS 
s.t. Xp41=F (xp Uz) 
= min (L(x, Uk) + V(F (Xp, ux), k + 1,n)). (11.54b) 
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It is also possible, given a value function V (x; k, n), to determine the next opti- 
mal control action u, by returning the u, that minimizes the above expression. 
This defines an optimal policy u = n(x). Dropping the functional dependence of 
V on the end time, we then have 


V(x) = min(£(x, u) + V (F(x, u))), (11.55a) 
n(x) = argmin(£(x, u) + V (F(x, u))). (11.55b) 


These form the Bellman equations. 

Note that we have explicitly included the terminal time t,; in the terminal 
cost Q(Xn, tn) and Q(x(ty),t;), as there are situations when the arrival time 
should be minimized. However, it is also possible to include the time explic- 
itly in the immediate cost C(x, u, t), for example to include a discount function 
e- for future costs or rewards. 
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Homework 


Exercise 11-1. This example will explore reinforcement learning on the game of 
tic-tac-toe. First, describe the states, actions, and rewards. 


Next, design a policy iteration algorithm to optimize the policy m. Begin with a 
randomly chosen policy. Plot the value function on the board and describe the 
optimal policy. 

How many policy iterations are required before the policy and value function 
converge? How many games were played at each policy iteration? Is this con- 
sistent with what you would expect a human learning would do? 


Is there any structure or symmetry in the game that could be used to improve 
the learning rate? Implement a policy iteration that exploits this structure, and 
determine how many policy iterations are required before converging and how 
many games played per policy iteration. 


Exercise 11-2. Repeat the above example using value iteration instead of policy 
iteration. Compare the number of iterations in both methods, along with the 
total training time. 


Exercise 11-3. This exercise will develop a reinforcement learning controller 
for the fluid flow past a cylinder. There are several open-source codes that can 
be used to simulate simple fluid flows, such as the IBPM code at 
github.com/cwrowley/ibpm/ 


Use reinforcement learning to develop a control law to force the cylinder wake 
to be symmetric. Describe the reward structure and what learning framework 
you chose. Also plot your results, including learning rates, performance, etc. 
How long did it take to train this controller (i.e., how many computational iter- 
ations, how much CPU time, etc.)? 

Now, assume that the RL agent only has access to the lift and drag coefficients, 
Cr and Cp. Design an RL scheme to track a given reference lift value, say 
Cr = 1 or Cz = —1. See if you can make your controller track a reference 
that switches between these values. What if the reference lift is much larger, 
say Cr = 2 or Cr = 5? 


Exercise 11-4. Install the AI Gym API and develop an RL controller for the clas- 
sic control example of a pendulum on a cart. Explore different RL strategies. 
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Chapter 12 
Reduced-Order Models (ROMs) 


The proper orthogonal decomposition (POD) is the SVD algorithm applied 
to partial differential equations (PDEs). As such, it is one of the most impor- 
tant dimensionality reduction techniques available to study complex, spatio- 
temporal systems. Such systems are typically exemplified by nonlinear PDEs 
that prescribe the evolution in time and space of the quantities of interest in a 
given physical, engineering, and/or biological system. The success of the POD 
is related to the seemingly ubiquitous observation that, in most complex sys- 
tems, meaningful behaviors are encoded in low-dimensional patterns of dy- 
namic activity. The POD technique seeks to take advantage of this fact in or- 
der to produce low-rank dynamical systems capable of accurately modeling 
the full spatio-temporal evolution of the governing complex system. Specifi- 
cally, reduced-order models (ROMs) leverage POD modes for projecting PDE dy- 
namics to low-rank subspaces where simulations of the governing PDE model 
can be more readily evaluated. Importantly, the low-rank models produced by 
the ROM allow for significant improvements in computational speed, poten- 
tially enabling prohibitively expensive Monte Carlo simulations of PDE sys- 
tems, optimization over parameterized PDE systems, and/or real-time con- 
trol of PDE-based systems. POD has been extensively used in the fluid dy- 
namics community [335]. It has also found a wide variety of applications in 
structural mechanics and vibrational analysis [437], optical and 
micro-electromechanical systems (MEMS) technologies [444] [657], atmospheric 
sciences (where it is called empirical orthogonal functions (EOFs)) [159], 
wind engineering applications [667], acoustics [243], and neuroscience 
[703]. The success of the method relies on its ability to provide physically inter- 
pretable spatio-temporal decompositions of data [79}|170) 243}, /381) |420) [444]. 
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12.1 Proper Orthogonal Decomposition (POD) for 
Partial Differential Equations 


Throughout the engineering, physical, and biological sciences, many systems 
are known to have prescribed relationships between time and space that drive 
patterns of dynamical activity. Even simple spatio-temporal relationships can 
lead to highly complex, yet coherent, dynamics that motivate the main thrust 
of analytic and computational studies. Modeling efforts seek to derive these 
spatio-temporal relationships either through first-principles laws or through 
well-reasoned conjectures about existing relationships, thus leading generally 
to an underlying partial differential equation (PDE) that constrains and governs 
the complex system. Typically, such PDEs are beyond our ability to solve ana- 
lytically. As a result, two primary solution strategies are pursued: computation 
and/or asymptotic reduction. In the former, the complex system is discretized 
in space and time to artificially produce an extremely high-dimensional system 
of equations which can be solved to a desired level of accuracy, with higher 
accuracy requiring a larger dimension of the discretized system. In this tech- 
nique, the high-dimensionality is artificial and simply a consequence of the un- 
derlying numerical solution scheme. In contrast, asymptotic reduction seeks to 
replace the complex system with a simpler set of equations, preferably that are 
linear so as to be amenable to analysis. Before the 1960s and the rise of computa- 
tion, such asymptotic reductions formed the backbone of applied mathematics 
in fields such a fluid dynamics. Indeed, asymptotics form the basis of the ear- 
liest efforts of dimensionality reduction. Asymptotic methods are not covered 
in this book, but the computational methods that enable reduced-order models 
are. 

To be more mathematically precise about our study of complex systems, we 
consider generically a system of nonlinear PDEs of a single spatial variable that 
can be modeled as 


u, = N(u, Uz, Oye, 2, £, t; B) (12.1) 


where the subscripts denote partial differentiation and N(-) prescribes the gener- 
ically nonlinear evolution. The parameter 6 will represent a bifurcation param- 
eter for our later considerations. Further, associated with are a set of ini- 
tial and boundary conditions on a domain x € |- ZL, L]. Historically, a number 
of analytic solution techniques have been devised to study (12.1). Typically the 
aim of such methods is to reduce the PDE to a set of ordinary differen- 
tial equations (ODEs). The standard PDE methods of separation of variables and 
similarity solutions are constructed for this express purpose. Once in the form 
of an ODE, a broader variety of analytic methods can be applied along with a 
qualitative theory in the case of nonlinear behavior [334]. This again highlights 
the role that asymptotics can play in characterizing behavior. 
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Although a number of potential solution strategies have been mentioned, 
does not admit a closed-form solution in general. Even the simplest non- 
linearity or a spatially dependent coefficient can render the standard analytic 
solution strategies useless. However, computational strategies for solving 
are abundant and have provided transformative insights across the physical, 
engineering, and biological sciences. The various computational techniques de- 
vised lead to an approximate numerical solution of (12.1), which is of high 
dimension. Consider, for instance, a standard spatial discretization of 
whereby the spatial variable x is evaluated at n > 1 points, 


u(zk, t) for k=1,2,...,n, (12.2) 


with spacing Av = £k}1— £p = 2L/n. Using standard finite-difference formulas, 
spatial derivatives can be evaluated using neighboring spatial points so that, 
for instance, 


U(Le41,t) — U(Tk-1, t) 

2Az 

U(Tk41, t) — 2u(zy, t) + u(rz_1, t) 
Aq 


Such spatial discretization transforms the governing PDE into a set of n 
ODEs: 

du, 

o IN (age t), U(£k, t), Oa t), -£k t; B), hj Loam (12.4) 
This process of discretization produces a more manageable system of equations 
at the expense of rendering high-dimensional. It should be noted that, as 
accuracy requirements become more stringent, the resulting dimension n of 
the system also increases, since Ax = 2L/n. Thus, the dimension of the 
underlying computational scheme is artificially determined by the accuracy of 
the finite-difference differentiation schemes. 

The spatial discretization of illustrates how high-dimensional systems 
are rendered. The artificial production of high-dimensional systems is ubiqui- 
tous across computational schemes and presents significant challenges for sci- 
entific computing efforts. To further illustrate this phenomenon, we consider a 
second computational scheme for solving (12.1). In particular, we consider the 
most common technique for analytically solving PDEs: separation of variables. 
In this method, a solution is assumed, whereby space and time are indepen- 
dent, so that 


(12.3a) 


U; = 


. (12.3b) 


Urr = 


u(x,t) = a(t)d(2), (12.5) 
where the variable a(t) subsumes all the time dependence of and y(x) 


characterizes the spatial dependence. Separation of variables is only guaran- 
teed to work analytically if (12.1) is linear with constant coefficients. In that 
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restrictive case, two differential equations can be derived that separately char- 
acterize the spatial and temporal dependences of the complex system. The dif- 
ferential equations are related by a constant parameter that is present in each. 

For the general form of (12.1), separation of variables can be used to yield 
a computational algorithm capable of producing accurate solutions. Since the 
spatial solutions are not known a priori, it is typical to assume a set of basis 
modes which are used to construct w(x). Indeed, such assumptions on basis 
modes underlie the critical ideas of the method of eigenfunction expansions. This 
yields a separation-of-variables solution ansatz of the form 


u(x,t) = X ag(t)bx(x), (12.6) 


where (x) form a set of n > 1 basis modes. As before, this expansion ar- 
tificially renders a high-dimensional system of equations since n modes are 
required. This separation-of-variables solution approximates the true solution, 
provided n is large enough. Increasing the number of modes n is equivalent to 
increasing the spatial discretization in a finite-difference scheme. 

The orthogonality properties of the basis functions 7,(x) enable us to make 
use of (12.6). To illustrate this, consider a scalar version of with the associ- 
ated scalar separable solution u(x,t) = )>7_, ax(t)vx(«). Inserting this solution 
into the governing equations gives 


Dutt =N eS Ane, So a(r) S o(a) sagt, B), (12.7) 


where the sums are from k = 1,2,...,n. Orthogonality of our basis functions 
implies that 


(Wr, Vi) = kj = k 7 H (12.8) 


where 6;; is the Kronecker delta function and (Yp, %;) is the inner product de- 
fined as 


L 
(Vr, Yi) = ‘a Pep; dz, (12.9) 


where * denotes complex conjugation. 

Once the modal basis is decided on, the governing equations for the a;(t) 
can be determined by multiplying by w(x) and integrating from x € 
[—L, L]. Orthogonality then results in the temporal governing equations, or 
Galerkin projected dynamics, for each mode 


T = (N (Y aij, Y ua Y Hia oB) (1210) 


for k=1,2,...,n. 
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The given form of N(-) determines the mode coupling that occurs between the 
various n modes. Indeed, the hallmark feature of nonlinearity is the production 
of modal mixing from (12.10). 

Numerical schemes based on the Galerkin projection (12.10) are commonly 
used to perform simulations of the full governing system (12.1). Convergence 
to the true solution can be accomplished by judicious choice of both the modal 
basis elements 7, as well as the total number of modes n. Interestingly, the 
separation-of-variables strategy, which is rooted in linear PDEs, works for non- 
linear and non-constant-coefficient PDEs, provided enough modal basis functions 
are chosen in order to accommodate all the nonlinear mode mixing that occurs 
in (12.10). A good choice of modal basis elements allows for a smaller set of n 
modes to be chosen to achieve a desired accuracy. The POD method is designed 
to specifically address the data-driven selection of a set of basis modes that are 
tailored to the particular dynamics, geometry, and parameters. 


Fourier Mode Expansion 


The most prolific basis used for the Galerkin projection technique is Fourier 
modes. More precisely, the fast Fourier transform (FFT) and its variants have 
dominated scientific computing applied to the engineering, physical, and bi- 
ological sciences. There are two primary reasons for this: (1) there is a strong 
intuition developed around the meaning of Fourier modes as it directly relates 
to spatial wavelengths and frequencies, and, more importantly, (2) the algo- 
rithm necessary to compute the right-hand side of can be executed in 
O(n log n) operations. The second fact has made the FFT one of the top 10 algo- 
rithms of the last century and a foundational cornerstone of scientific comput- 
ing. 
The Fourier mode basis elements are given by 


1 2rkr 
Prlz) = 7 exp ( i ) (12.11) 
for xe€f[0,L] and k= —n/2,...,—1,0,1,...,n/2—1. 


It should be noted that in most software packages, including MATLAB, the FFT 
command assumes that the spatial interval is x € [0, 27]. Thus one must rescale 
a domain of length L to 27 before using the FFT. 

Obviously the Fourier modes are complex periodic functions on the 
interval x € [0, L]. However, they are applicable to a much broader class of 
functions that are not necessarily periodic. For instance, consider a localized 
Gaussian function 

u(x,t) = exp(—ox?) (12.12) 


whose Fourier transform is also a Gaussian. In representing such a function 
with Fourier modes, a large number of modes are often required since the func- 
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Figure 12.1: Illustration of Fourier modes for representing a localized Gaussian 
pulse. (a) Here n = 80 Fourier modes are used to represent the Gaussian u(x) = 
exp(—ozx?) in the domain x € [—10,10] for o = 0.1 (red), ¢ = 1 (black), and 
o = 10 (blue). (b) The Fourier mode representation of the Gaussian, showing 
the modes required for an accurate representation of the localized function. 
(c) The convergence of the n-mode solution to the actual Gaussian (o = 1); with 
(d) the L? error from the true solution for the three values of ø. 


tion itself is not periodic. Figure shows the Fourier mode representation 
of the Gaussian for three values of ø. Of note is the fact that a large number of 
modes are required to represent this simple function, especially as the Gaussian 
width is decreased. Although the FFT algorithm is extremely fast and widely 
applied, one can see immediately that a large number of modes are generically 
required to represent simple functions of interest. Thus, solving problems us- 
ing the FFT often requires high-dimensional representations (i.e., n >> 1) to ac- 
commodate generic, localized spatial behaviors. Ultimately, our aim is to move 
away from artificially creating such high-dimensional problems. 


Special Functions and Sturm-Liouville Theory 


In the 1800s and early 1900s, mathematical physics developed many of the gov- 
erning principles behind heat flow, electromagnetism, and quantum mechan- 
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ics, for instance. Many of the hallmark problems considered were driven by 
linear dynamics, allowing for analytically tractable solutions. And since these 
problems arose before the advent of computing, nonlinearities were typically 
treated as perturbations to an underlying linear equation. Thus one often con- 
sidered complex systems of the form 


u, = Lu + eN(u, ug, Uge,..., £, t; B), (12.13) 


where L is a linear operator and e < 1 is a small parameter used for pertur- 
bation calculations. Often in mathematical physics, the operator L is a Sturm- 
Liouville operator, which guarantees many advantageous properties of the eigen- 
values and eigenfunctions. 

To solve equations of the form in (12.13), special modes are often used that 
are ideally suited for the problem. Such modes are eigenfunctions of the under- 


lying linear operator L in (12.13): 
Lv, = Arr, (12.14) 


where p(x) are orthonormal eigenfunctions of the operator L. The eigenfunc- 
tions allow for an eigenfunction expansion solution whereby u(x,t) = >> az(t)Yx(2). 
This leads to the following solution form: 


ch = aah seas (12.15) 


The key idea in such an expansion is that the eigenfunctions presumably are 
ideal for modeling the spatial variations particular to the problem under con- 
sideration. Thus, they would seem to be ideal, or perfectly suited, modes for 
(12.13). This is in contrast to the Fourier mode expansion, as the sinusoidal 
modes may be unrelated to the particular physics or symmetries in the geom- 
etry. For example, the Gaussian example considered can be potentially repre- 
sented more efficiently by Gauss—Hermite polynomials. Indeed, the wide vari- 
ety of special functions, including the Sturm—Liouville operators of Bessel, La- 
guerre, Hermite, and Legendre, for instance, are aimed at making the representa- 
tion of solutions more efficient and much more closely related to the underly- 
ing physics and geometry. Ultimately, one can think of using such functions as 
a way of doing dimensionality reduction by using an ideally suited set of basis 
functions. 


Dimensionality Reduction 


The examples above and solution methods for PDEs illustrate a common prob- 
lem of scientific computing: the generation of n-degree, high-dimensional sys- 
tems. For many complex PDEs with several spatial dimensions, it is not uncom- 
mon for discretization or modal expansion techniques to yield systems of dif- 
ferential equations with millions or billions of degrees of freedom. Such large 
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systems are extremely demanding for even the latest computational architec- 
tures, limiting accuracies and run-times in the modeling of many complex sys- 
tems, such as high-Reynolds-number fluid flows. 

To aid in computation, the selection of a set of optimal basis modes is crit- 
ical, as it can greatly reduce the number of differential equations generated. 
Many solution techniques involve the solution of a linear system of size n, 
which generically involves O(n?) operations. Thus, reducing n is of paramount 
importance. One can already see that, even in the 1800s and early 1900s, the spe- 
cial functions developed for various problems of mathematical physics were an 
analytic attempt to generate an ideal set of modes for representing the dynam- 
ics of the complex system. However, for strongly nonlinear, complex systems 
(12.1), even such special functions rarely give the best set of modes. In the next 
section, we show how one might generate modes 7, that are tailored specifi- 
cally for the dynamics and geometry in (12.1). Based on the SVD algorithm, the 
proper orthogonal decomposition (POD) generates a set of modes that are optimal 
for representing either simulation or measurement data, potentially allowing 
for significant reduction of the number of modes n required to model the be- 


havior of (12.1) for a given accuracy 738). 


12.2 Optimal Basis Elements: the POD Expansion 


As illustrated in the previous section, the selection of a good modal basis for 
solving using the Galerkin expansion in is critical for efficient sci- 
entific computing strategies. Many algorithms for solving PDEs rely on choos- 
ing basis modes a priori based on (i) computational speed, (ii) accuracy, and/or 
(iii) constraints on boundary conditions. All these reasons are justified and form 
the basis of computationally sound methods. However, our primary concern in 
this chapter is in selecting a method that allows for maximal computational ef- 
ficiency via dimensionality reduction. As already highlighted, many algorithms 
generate artificially large systems of size n. In what follows, we present a data- 
driven strategy, whereby optimal modes, also known as POD modes, are se- 
lected from numerical and/or experimental observations, thus allowing for a 
minimal number of modes r < n to characterize the dynamics of (12.1). 

Two options exist for extracting the optimal basis modes from a given com- 
plex system. Either one can collect data directly from an experiment, or one can 
simulate the complex system and sample the state of the system as it evolves ac- 
cording to the dynamics. In both cases, snapshots of the dynamics are taken and 
optimal modes identified. In the case when the system is simulated to extract 
modes, one can argue that no computational savings are achieved. However, 
much like the LU decomposition, which has an initial one-time computational 
cost of O(n?) before further O(n?) operations can be applied, the costly modal 
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extraction process is performed only once. The optimal modes can then be used 
in a computationally efficient manner thereafter. 

To proceed with the construction of the optimal POD modes, the dynamics 
of are sampled at some prescribed time interval. In particular, a snapshot 
u consists of samples of the complex system, with subscript k indicating sam- 
pling at time tp, i.e., up := [u(z1,tk) U(£2,tk) +++ Ulan, Se Now, the con- 
tinuous functions and modes will be evaluated at n discrete spatial locations, 
resulting in a high-dimensional vector representation; these will be denoted by 
bold symbols. We are generally interested in analyzing the computationally or 
experimentally generated large data set X: 


X= |u UW > Up, 


where the columns ux = u(t) € C” may be measurements from simulations or 
experiments. Matrix X consists of a time series of data, with m distinct measure- 
ment instants in time. Often the state dimension n is very large, on the order of 
millions or billions in the case of fluid systems. Typically n > m, resulting in a 
tall-skinny matrix, as opposed to a short-fat matrix when n < m. 

As discussed previously, the singular value decomposition (SVD) provides 
a unique matrix decomposition for any complex-valued matrix X € C"*™: 


l (12.16) 


xesUsy, (12.17) 


where U € C”*” and V € C™*™ are unitary matrices, and & € C”*™ is a matrix 
with non-negative entries on the diagonal. Here * denotes the complex conju- 
gate transpose. The columns of U are called left singular vectors of X and the 
columns of V are right singular vectors. The diagonal elements of © are called 
singular values and they are ordered from largest to smallest. The SVD provides 
critical insight into building an optimal basis set tailored to the specific prob- 
lem. In particular, the matrix U is guaranteed to provide the best set of modes 
to approximate X in an ¢) sense. Specifically, the columns of this matrix contain 
the orthogonal modes necessary to form the ideal basis. The matrix V gives 
the time history of each of the modal elements, and the diagonal matrix © is 
the weighting of each mode relative to the others. Recall that the modes are 
arranged with the most dominant first and the least dominant last. 

The total number of modes generated is typically determined by the num- 
ber of snapshots m taken in constructing X (where normally n > m). Our ob- 
jective is to determine the minimal number of modes necessary to accurately 
represent the dynamics of with a Galerkin projection (12.6). Thus we 
are interested in a rank-r approximation to the true dynamics where typically 
r <m. The quantity of interest is then the low-rank decomposition of the SVD 
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given by ee 
x = UDV*, (12.18) 
where ||X — X|| < efor a given small value of e. This low-rank truncation allows 


us to construct the modes of interest Y, from the columns of the truncated 
matrix U. In particular, the optimal basis modes are given by 


U=WV= pi Wo a YP, , (12.19) 


where the truncation preserves the r most dominant modes used in (12.6). The 
truncated r modes {4% , Y2,- . . , Y, } are then used as the low-rank, orthogonal 
basis to represent the dynamics of (12.1). 

The above snapshot-based method for extracting the low-rank, r-dimensional 
subspace of dynamic evolution associated with is a data-driven compu- 
tational architecture. Indeed, it provides an equation-free method, i.e., the gov- 
erning equation may actually be unknown. In the event that the under- 
lying dynamics are unknown, then the extraction of the low-rank space allows 
one to build potential models in an r-dimensional subspace as opposed to re- 
maining in a high-dimensional space where n > r. These ideas will be explored 
further in what follows. However, it suffices to highlight at this juncture that an 
optimal basis representation does not require an underlying knowledge of the 


complex system (12.1). 


Galerkin Projection onto POD Modes 


It is possible to approximate the state u of the PDE using a Galerkin expansion: 


u(t) ~ Wa(t), (12.20) 


where a(t) € R” is the time-dependent coefficient vector and r < n. Plugging 
this modal expansion into the governing equation (12.13) and applying orthog- 
onality (multiplying by Y”) gives the dimensionally reduced evolution 


da(t) 
dt 


= W'LWa(t) + P N(Wa(t), 8). (12.21) 


By solving this system of much smaller dimension, the solution of a high- 
dimensional nonlinear dynamical system can be approximated. Of critical im- 
portance is evaluating the nonlinear terms in an efficient way using the gappy 
POD or discrete empirical interpolation method (DEIM) mathematical archi- 
tecture in Chapter Otherwise, the evaluation of the nonlinear terms still 
requires calculation of functions and inner products with the original dimen- 
sion n. In certain cases, such as the quadratic nonlinearity of Navier-Stokes, the 
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nonlinear terms can be computed once in an offline manner. However, param- 
eterized systems generally require repeated evaluation of the nonlinear terms 
as the POD modes change with 6. 


Example: the Harmonic Oscillator 


To illustrate the POD method for selecting optimal basis elements, we will con- 
sider a classic problem of mathematical physics: the quantum harmonic oscillator. 
Although the ideal basis functions (Gauss—Hermite functions) for this problem 
are already known, we would like to infer these special functions in a purely 
data-driven way. In other words, can we deduce these special functions from 
snapshots of the dynamics alone? The standard harmonic oscillator arises in 
the study of spring—mass systems. In particular, one often assumes that the 
restoring force F of a spring is governed by the linear Hooke’s law: 


F(t) = —ka, (12.22) 


where k is the spring constant and x(t) represents the displacement of the 
spring from its equilibrium position. Such a force gives rise to a potential en- 
ergy for the spring of the form V = kx?/2. 

In considering quantum mechanical systems, such a restoring force (with 
k = 1 without loss of generality) and associated potential energy give rise to 
the Schrodinger equation with a parabolic potential, 


; 1 x 

iut + 5 ea — z” =0, (12.23) 

where the second term in the partial differential equation represents the kinetic 

energy of a quantum particle while the last term is the parabolic potential as- 
sociated with the linear restoring force. 

The solution for the quantum harmonic oscillator can be easily computed 


in terms of special functions. In particular, by assuming a solution of the form 

u(x,t) = ag,(x) exp[—i(k + $)t], (12.24) 

with a, determined from initial conditions, one finds the following boundary 
value problem for the eigenmodes of the system: 

dy 

da? 

with the boundary conditions Yy — 0 as x — too. Normalized solutions to 

this equation can be expressed in terms of Hermite polynomials, H;,(«), or the 

Gaussian—Hermite functions, 
We = (2k! yr)" exp(—a?/2) Hy (x) (12.26a) 


= (-1)*(2* kl yr)" exp(—22/2) exp(—2”). (12.26b) 


H (2k +1—27)y,, (12.25) 
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The Gauss-Hermite functions are typically thought of as the optimal basis 
functions for the harmonic oscillator, as they naturally represent the under- 
lying dynamics driven by the Schrödinger equation with parabolic potential. 
Indeed, solutions of the complex system can be represented as the sum 


ulz, t) = ` apl" k! yT)? exp(—x°/2)H,(x) exp[—i(k + E)t]. (12.27) 


Such a solution strategy is ubiquitous in mathematical physics, as is evidenced 
by the large number of special functions, often of Sturm—Liouville form, for dif- 
ferent geometries and boundary conditions. These include Bessel functions, La- 
guerre polynomials, Legendre polynomials, parabolic cylinder functions, spher- 
ical harmonics, etc. 

A numerical solution to the governing PDE based on the fast Fourier 
transform is easy to implement [420]. The following code executes a full numer- 
ical solution with the initial conditions u(z,0) = exp(—0.2(a — x9)*), which is 
a Gaussian pulse centered at x = x. This initial condition generically excites 
a number of Gauss—Hermite functions. In particular, the initial projection onto 
the eigenmodes is computed from the orthogonality conditions so that 


This inner product projects the initial condition onto each mode y. 


Code 12.1: [MATLAB] Harmonic oscillator code. 


L=30; n=512; x2=linspace(-L/2,L/2,nt+1); x=x2 (1:n); % spatial 
discretization 

KE API OA Osssiir/e2 me tai eos ees % wavenumbers for FFT 

Rise? ee % potential 

£t=0:0.2:20; % time domain collection points 

W=exp (—On2 s(x) 2) e Intrall Conditrons 

ut=fft (u); o FET Tine data 


[t,utsol]=ode45(’pod_harm_rhs’,t,ut,[],k,V); <% integrate PDE 
for j=1:length(t) 

WSO) —LEte (esol yi 2 transforming back 
end 


Code 12.1: [Python] Harmonic oscillator code. 


a= op. exp 0 2anp. power a 12N fF initial Condit rons 

Ge Np Ere. cee (U) a PET Tnieda data 

ut split = np:-concatenate((np.real (ut), np-imag(ut))) 

utsol_split = integrate. oderint (harm_rhs,ut_split,t,mxstep 
=10**6) 
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tieso S ae ee sa ae (Lap eens o aae ae] 


usol = np.zeros_like(utsol) 
for jj in range(len(t)): 
uso Hno 8 Enp Eeer aere (bbe oll ahaha sly) 


The right-hand side function pod_harm_rhs.m associated with the above 
code contains the governing equation (12.23) in a three-line MATLAB code: 


Code 12.2: [MATLAB] Harmonic oscillator right-hand side. 


function rhs=pod_harm_rhs (t,ut,dummy,k,V) 
u=L££Ft (ut); 
rhs=— (1/2) * (k.72) .*ut ~ 0.5xixfft (V.«u); 


Code 12.2: [Python] Harmonic oscillator right-hand side. 


def harm_rhs (ut_split,t,k=k, V=V,n=n): 
(ie, ee sphate ed) ae dCi) cabin see oul site| pale | 
ie ee tes tte Pee (UE) 
ERS = 9 Ooo (laine power (ie) lie — 9 Oro, (i ai)ecm Ob te. ee eG 
xu) 
rìs split = np.concatenate((np.real(rhs),np.imag(rhs) ) ) 
return rhs_split 


The two codes together produce dynamics associated with the quantum 
harmonic oscillator. Figure shows the dynamical evolution of an initial 
Gaussian u(x,0) = exp(—0.2(% — 29)*) with zo = 0 (top left) and zo = 1 (top 
right). From the simulation, one can see that there are a total of 101 snapshots 
(the initial condition and an additional 100 measurement times). These snap- 
shots can be organized as in and the singular value decomposition per- 
formed. The singular values of the decomposition are suggestive of the under- 
lying dimensionality of the dynamics. For the dynamical evolution observed in 
the top panels of Fig. the corresponding singular values of the snapshots 
are given in the bottom panels. For the symmetric initial condition (symmetric 
about x = 0), five modes dominate the dynamics. In contrast, for an asymmetric 
initial condition, twice as many modes are required to represent the dynamics 
with the same precision. 

The singular value decomposition not only gives the distribution of energy 
within the first set of modes, but it also produces the optimal basis elements 
as columns of the matrix U. The distribution of singular values is highly sug- 
gestive of how to truncate with a low-rank subspace of r modes, thus allow- 
ing us to construct the dimensionally reduced space appropriate for a 
Galerkin—POD expansion. 

The modes of the quantum harmonic oscillator are illustrated in Fig. 
Specifically, the first five modes are shown for (i) the Gauss—Hermite functions 
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Figure 12.2: Dynamics of the quantum harmonic oscillator given the 
initial condition u(x,0) = exp(—0.2(a — xo)*) for zo = 0 (top left) and zo = 1 
(top right). The symmetric initial data elicits a dominant five-mode response 
while the initial condition with initial offset x» = 1 activates 10 modes. The 
bottom panels show the singular values of the SVD of the corresponding top 
panels, along with the percentage of energy (or L*-norm) in each mode. The 
dynamics are clearly low-rank given the rapid decay of the singular values. 


representing the special function solutions, (ii) the modes of the SVD for the 
symmetric (xo = 0) initial conditions, and (iii) the modes of the SVD for the 
offset (asymmetric, 79 = 1) initial conditions. The Gauss—Hermite functions, 
by construction, are arranged from lowest eigenvalue of the Sturm—Liouville 
problem (12.25). The eigenmodes alternate between symmetric and asymmet- 
ric modes. For the symmetric (about x = 0) initial conditions given by u(x, 0) = 
exp(—0.22”), the first five modes are all symmetric, as the snapshot-based method 
is incapable of producing asymmetric modes since they are actually not part of 
the dynamics, and thus they are not observable or manifested in the evolu- 
tion. In contrast, with a slight offset, u(x, 0) = exp(—0.2(x — 1)?), snapshots of 
the evolution produce asymmetric modes that closely resemble the asymmet- 
ric modes of the Gauss—Hermite expansion. Interestingly, in this case, the SVD 
arranges the modes by the amount of energy exhibited in each mode. Thus the 
first asymmetric mode (bottom panel in red — third mode) is equivalent to the 
second mode of the exact Gauss—Hermite polynomials (top panel in green — 
second mode). The key observation here is that the snapshot-based method is 
capable of generating, or nearly so, the known optimal Gauss—Hermite poly- 
nomials characteristic of this system. Importantly, the Galerkin-POD method 
generalizes to more complex physics and geometries where the solution is not 
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Figure 12.3: First five modes of the quantum harmonic oscillator. In the top 
panel, the first five Gauss-Hermite modes (12.26), arranged by their Sturm- 
Liouville eigenvalue, are illustrated. The second panel shows the dominant 
modes computed from the SVD of the dynamics of the harmonic oscillator 
with u(z,0) = exp(—0.22°), illustrated in Fig. (left). Note that the modes 
are all symmetric, since no asymmetric dynamics was actually manifested. For 
the bottom panel, where the harmonic oscillator was simulated with the offset 
Gaussian u(x,0) = exp(—0.2(x—1)*), asymmetry is certainly observed. This also 
produces modes that are very similar to the Gauss—Hermite functions. Thus a 
purely snapshot-based method is capable of reproducing the nearly ideal basis 
set for the harmonic oscillator. 


known a priori. 


12.3 POD and Soliton Dynamics 


To illustrate a full implementation of the Galerkin—POD method, we will con- 
sider an illustrative complex system whose dynamics are strongly nonlinear. 
Thus, we consider the nonlinear Schrödinger (NLS) equation, 


ius + Uss + |ul?u = 0, (12.29) 


with the boundary conditions u — 0 as x — +oo. If not for the nonlinear term, 
this equation could be solved easily in closed form. However, the nonlinearity 
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mixes the eigenfunction components in the expansion (12.6), and it is impossi- 
ble to derive a simple analytic solution. 

To solve the NLS computationally, a Fourier mode expansion is used. Thus 
the standard fast Fourier transform may be leveraged. Rewriting in the 
Fourier domain, i.e., taking the Fourier transform, gives the set of differential 
equations 


A 


i= — Shi + iful2u, (12.30) 


where the Fourier mode mixing occurs due to the nonlinear mixing in the cubic 
term. This gives the system of differential equations to be solved in order to 
evaluate the NLS behavior. 

It now remains to consider a specific spatial configuration for the initial 
condition. For the NLS, there are a set of special initial conditions called solitons 
where the initial conditions are given by 


u(z,0) = Nsech(z), (12.31) 


where N is an integer. We will consider the soliton dynamics with N = 1 and 
N = 2. First, the initial condition is projected onto the Fourier modes with the 
fast Fourier transform. 

The dynamics of the N = 1 and N = 2 solitons are demonstrated in Fig.{12.4| 
During evolution, the N = 1 soliton only undergoes phase changes while its 
amplitude remains stationary. In contrast, the N = 2 soliton undergoes periodic 
oscillations. In both cases, a large number of Fourier modes, about 50 and 200, 
respectively, are required to model the simple behaviors illustrated. 

The obvious question to ask in light of our dimensionality reduction think- 
ing is this: Is the soliton dynamics really a 50- or 200-degrees-of-freedom system 
as required by the Fourier mode solution technique? The answer is no. Indeed, 
with the appropriate basis, i.e., the POD modes generated from the SVD, it can 
be shown that the dynamics is a simple reduction to one or two modes, respec- 
tively. Indeed, it can easily be shown that the N = 1 and N = 2 solitons are 
truly low-dimensional, as shown by the evolutions in Fig. 

Figure [12.5] explicitly demonstrates the low-dimensional nature of the nu- 
merical solutions by computing the singular values, along with the modes to 
be used in our new eigenfunction expansion. For both of these cases, the dy- 
namics are truly low-dimensional with the N = 1 soliton being modeled well 
by a single POD mode while the N = 2 dynamics are modeled quite well with 
two POD modes. Thus, in performing an eigenfunction expansion, the modes 
chosen should be the POD modes generated from the simulations themselves. 
In the next section, we will derive the dynamics of the modal interaction for 
these two cases, which are low-dimensional and amenable to analysis. 
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Figure 12.4: Evolution of the (a) N = 1 and (b) N = 2 solitons. Here steady-state 
(N = 1, panels (a) and (c)) and periodic (N = 2, panels (b) and (d)) dynamics 
are observed and approximately 50 and 200 Fourier modes, respectively, are 


required to model the behaviors. 


Soliton Reduction (N = 1) 


To take advantage of the low-dimensional structure, we first consider the N = 1 
soliton dynamics. Figure shows that a single mode in the SVD dominates 
the dynamics. This is the first column of the U matrix. Thus the dynamics are 


recast in a single mode so that 
u(x,t) = a(t)Y (z). 
Plugging this into the NLS equation yields the following: 
iar) + tapas + |al?alh| = 0. 
The inner product is now taken with respect to y, which gives 


iat + 5a + Bla\?a = 0, 


where 
ax b) 
OO By 
g= (KA 
(pp) 


(12.32) 


(12.33) 


(12.34) 


(12.35a) 


(12.35b) 
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Figure 12.5: Projection of the N = 1 and N = 2 evolutions onto their POD 
modes. Panels (a) and (b) are the singular values c; on a logarithmic scale of 
the two evolutions demonstrated in Fig. {12.4} This demonstrates that the N = 1 
and N = 2 soliton dynamics are primarily low-rank, with the N = 1 being 
a single-mode evolution and the N = 2 being dominated by two modes that 
contain approximately 95% of the evolution variance. The first three modes in 
both cases are shown in panels (c) and (d). 


This is the low-rank approximation achieved by the Galerkin—POD method. 
The differential equation (12.34) for a(t) can be solved explicitly to yield 


a(t) = a(0) exp (St J Bla(0)?t), (12.36) 


where a(0) is the initial condition for a(t). To find the initial condition, recall 
that 
u(x, 0) = sech(x) = a(0)v(z). (12.37) 


Taking the inner product with respect to y(x) gives 
(sech(), Y) 


a(0) = 12.38 
O= Oe) aie 

Thus the one-mode expansion gives the approximate PDE solution 
u(x,t) = a(0) exp («St + Bla(0)/*t) w(x). (12.39) 
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This solution is the low-dimensional POD approximation of the PDE expanded 
in the best basis possible, i.e., the SVD basis. 

For the N = 1 soliton, the spatial profile remains constant while its phase 
undergoes a nonlinear rotation. The POD solution can be solved exactly 
to characterize this phase rotation. 


Soliton Reduction (N = 2) 


The N = 2 soliton case is a bit more complicated and interesting. In this case, 
two modes clearly dominate the behavior of the system, as they contain 96% of 
the energy. These two modes, y; and y», are the first two columns of the matrix 
U and are now used to approximate the dynamics observed in Fig.{12.4] In this 
case, the two-mode expansion takes the form 


u(x,t) = a(t) (x) + a2(t)p2(2). (12.40) 
Inserting this approximation into the governing equation (12.29) gives 


iar + azp) + 5 (GV 192 + 02202) 
+ (arti + aoa)’ (aï p] + a343) = 0. (12.41) 


Multiplying out the cubic term gives 


ifar + azp) + (1120 + d2hrex) 
+ [a all y + laolalta] h2 + 2la1| a21 p 
+ 2ļa2|*ai 2| p1 + ataswiy; + asap yi = 0. (12.42) 


All that remains is to take the inner product of this equation with respect to 
both yı (x) and y2(x). Recall that these two modes are orthogonal, resulting in 
the following 2 x 2 system of nonlinear equations: 


1044 + 4404 + Q122 + (Bilal? T 2011 |a2|")a1 
+ (Bi21\a1|? + 26001 |a2|")a2 + 01210703 + oo,030% = 0, (12.43a) 

iaz + 2101 + A22A2 + (B112|a1|" T 2B212ļa2| Ja 
+ (Bi22la1 a 28299|a2|") a2 + O122070% + 09120350 =0, (12.43b) 

where 

Ajk = (Viza Vk) / 2; (12.44a) 
Birt = (l | Prsti): (12.44b) 
Sje = (Yk Vr), (12.44c) 
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and the initial values of the two components are given by 


(2 sech(x), Yı) 


a,(0) = ay (12.45a) 
E _ (2sech(x), Y2) 
»(0) = a (12.45b) 


This gives a complete description of the two-mode dynamics predicted from 
the SVD analysis. 

The two-mode dynamics accurately approximates the solution. However, 
there is a phase drift that occurs in the dynamics that would require both higher 
precision in the time series of the full PDE and more accurate integration of the 
inner products for the coefficients. Indeed, the most simple trapezoidal rule 
has been used to compute the inner products, and its accuracy is somewhat 
suspect; this issue will be addressed in the following section. Higher-order 
schemes could certainly help improve the accuracy. Additionally, incorporat- 
ing the third or higher modes could also help. In either case, this demonstrates 
how one would use the low-dimensional structures to approximate PDE dy- 
namics in practice. 


12.4 Continuous Formulation of POD 


Thus far, the POD reduction has been constructed to accommodate discrete 
data measurement snapshots X as given by (12.16). The POD reduction gener- 
ates a set of low-rank basis modes W so that the following least-squares error is 
minimized: 
argmin |X — BW? X|/p. (12.46) 
W s.t. rank(W)=r 

Recall that X € C”*™ and © e C”*", where r is the rank of the truncation. 

In many cases, measurements are performed on a continuous-time process 
over a prescribed spatial domain; thus the data we consider are constructed 
from trajectories 


u(z,t) with te [0,7], z € [-Z, L]. (12.47) 


Such data require a continuous-time formulation of the POD reduction. In par- 
ticular, an equivalent of must be constructed for these continuous-time 
trajectories. Note that, instead of a spatially dependent function u(x,t), one 
can also consider a vector of trajectories u(t) € C”. This may arise when a 
PDE is discretized so that the infinite-dimensional spatial variable x is finite- 
dimensional. Wolkwein gives an excellent, technical overview of the 
POD method and its continuous formulation. 
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To define the continuous formulation, we prescribe the inner product 


= [ jae" (aida. (12.48) 
-L 


To find the best-fit function through the entire temporal trajectory u(x,t) in 
(12.47), the following minimization problem must be solved: 


1 fT 
min F lulz, t) — (u(x,t), (w))ol|P dt subjectto |y’ =1, (12.49) 

0 
where the normalization of the temporal integral by 1/T averages the differ- 
ence between the data and its low-rank approximation using the function ~ 
over the time t € {0, T]. Equation (12.49) is equivalent to maximizing the inner 


product between the data u(x,t) and the function y(x), i.e., they are maximally 
parallel in function space. Thus the minimization problem can be restated as 


maxa f l wooo dt subjectto jbl? = 1. (12.50) 


The constrained optimization problem in (12.50) can be reformulated as a 
Lagrangian functional, 


= hf ul lulz, t), (x)) dt + A(1 — Iyl’), (12.51) 


where À is the Lagrange multiplier that enforces the constraint ||7)||? = 1. This 
can be rewritten as 


cw.n= Ff (f wesw eas f w(e,tute) ar) at 


+r-fi)+a(1-f vewrwar). azs 

The Lagrange multiplier problem requires that the functional derivative be 
Zero: T 

an Y 12.53 


Applying this derivative constraint to (12.52) and interchanging integrals yields 


= 2 f i dé È / i (uen f f deta ar) dt — wwa) =0. (12.54) 


Setting the integrand to zero, the following eigenvalue problem is derived: 
(R(E, £), Y) = Ay, (12.55) 
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where R(E, x) is a two-point correlation tensor of the continuous data u(x,t), 
which is averaged over the time interval where the data is sampled: 


1 fT 
RE, z) = F u(&, tju” (x, t) dt. (12.56) 
0 
If the spatial direction x is discretized, resulting in a high-dimensional vector 
u(t) = [u(ai,t) u(ae,t) > u(an,t)]", then R(€, x) becomes 
1 fT 
R= 7 f Owed (12.57) 


In practice, the function R is evaluated using a quadrature rule for integra- 
tion. This will allow us to connect the method to the snapshot-based method 
discussed thus far. 


Quadrature Rules for R: Trapezoidal Rule 


The evaluation of the integral can be performed by numerical quadra- 
ture [420]. The simplest quadrature rule is the trapezoidal rule, which evaluates 
the integral via summation of approximating rectangles. Figure [12.6jillustrates 
a version of the trapezoidal rule where the integral is approximated by a sum- 
mation over a number of rectangles. This gives the approximation of the two- 
point correlation tensor: 


1 fF : 
R= “al u(t)u*(t) dt 
-Atoa . : 
= [uiu + ujus + + UU]. 


Here we have assumed u(x,t) is discretized into a vector u; = u(t;), and there 
are m rectangular bins of width At so that (m)At = T. Defining a data matrix 


X = |u u + Uml, (12.59) 
we can then rewrite the two-point correlation tensor as 
1 
R~ —X*X, (12.60) 
m 


which is exactly the definition of the covariance matrix in (1.39), i.e, C ~ R. 
Note that the role of 1/T is to average over the various trajectories so that the 
average is subtracted out, giving rise to a definition consistent with the covari- 
ance. 
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0 lee T 
Figure 12.6: Illustration of an implementation of the quadrature rule to evaluate 


the integrals i f(t) dt. The rectangles of height f(t;) = fj and width ôt are 
summed to approximate the integral. 


Higher-Order Quadrature Rules 


Numerical integration simply calculates the area under a given curve. The basic 
ideas for performing such an operation come from the definition of integration, 


f(t)dt = lim Y JGA; (12.61) 


where b — a = (m — 1)At. The area under the curve is a limiting process of 
summing up an ever-increasing number of rectangles. This process is known 
as numerical quadrature. Specifically, any sum can be represented as follows: 


m-1 
QUA] = X wf (tj) = wof (to) + wif (t1) +: + Wma f(tm—1), (12.62) 
j=0 
where a = to < tı < t2 < -++ < tm_i = b. Thus the integral is evaluated as 


| feoa=alrl+ eta, (12.63) 


where the term E[/'] is the error in approximating the integral by the quadrature 
sum (12.62). Typically, the error E[f] is due to truncation error. To integrate, we 
will use polynomial fits to the y-values f(t;). Thus we assume the function f(t) 
can be approximated by a polynomial, 


Py (t) = ant” + i owt + ao, (12.64) 


where the truncation error in this case is proportional to the (n+1)th derivative 
E|f] = Aft (c) and A is a constant. This process of polynomial fitting the 
data gives the Newton—Cotes formulas. 
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The following integration approximations result from using a polynomial 
fit through the data to be integrated. It is assumed that 


tk = to + Atk and fk = Filta): (12.65) 
This gives the following integration algorithms: 


trapezoid rule 


tı 3 
f roa- Fnr- Ero (12.66a) 


Simpson’s rule 


"F(ab = fo + 4h + fa) - 


to 


ie 


a (12.66b) 


Simpson’s 3/8 rule 


3At 3At 
f(t) dt = g Wot 3A taft fs) — 30 FPO), (12.66c) 
to 
Boole’s rule 
ta 7 
p(t)at = at (7 EE T + 12fe + 32h + 7h) — oo O E 
to 45 945 


These algorithms have varying degrees of accuracy. Specifically, they are O(At?), 
O(At*), O(At*), and O(At®) accurate schemes, respectively. The accuracy con- 
dition is determined from the truncation terms of the polynomial fit. Note that 
the trapezoidal rule uses a sum of simple trapezoids to approximate the inte- 
gral. Simpson’s rule fits a quadratic curve through three points and calculates 
the area under the quadratic curve. Simpson’s 3/8 rule uses four points and a 
cubic polynomial to evaluate the area, while Boole’s rule uses five points and a 
quartic polynomial fit to generate an evaluation of the integral. 

The integration methods give values for the integrals over only a 
small part of the integration domain. The trapezoidal rule, for instance, only 
gives a value for t € [to,t,]. However, our fundamental aim is to evaluate the 
integral over the entire domain t € |a, b]. Assuming once again that our interval 
is divided as a = ty < tı < t2 < +++ < tm-1 = b, then the trapezoidal rule 
applied over the interval gives the total integral 


’ T At 
f roan E+ fw) (12.67) 
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Writing out this sum gives 


3 Fit fan) = Suet at Fh t alt t nt faa) 


= — (fo + 2fi +2f2 + + 2fm + fm-1) (12.68) 


-2 (i pha 2 


j=1 


The final expression no longer double-counts the values of the points between 
fo and f,,-1. Instead, the final sum only counts the intermediate values once, 
thus making the algorithm about twice as fast as the previous sum expression. 
These are computational savings which should always be exploited if possible. 


POD Modes from Quadrature Rules 


Any of these algorithms could be used to approximate the two-point correla- 
tion tensor R(¢, x). The method of snapshots implicitly uses the trapezoidal 
rule to produce the snapshot matrix X. Specifically, recall that 


X= |u, UW © unl, (12.69) 
| | 


where the columns u; € C” may be measurements from simulations or experi- 
ments. The SVD of this matrix produces the modes used to produce a low-rank 
embedding W of the data. 

One could alternatively use a higher-order quadrature rule to produce a 
low-rank decomposition. Thus the matrix would be modified to 


X = |u; 4u, 2u, 4uy 2us --- 4Um-1 Un], (12.70) 
a | | J | || 


where the Simpson’s rule quadrature formula is used. Simpson’s rule is com- 
monly used in practice, as it is simple to execute and provides significant im- 
provement in accuracy over the trapezoidal rule. Producing this matrix simply 
involves multiplying the data matrix onthe rightby [1 4 2 4 -> 2 4 1] 5 
The SVD can then be used to construct a low-rank embedding W. Before ap- 
proximating the low-rank solution, the quadrature weighting matrix must be 
undone. To our knowledge, very little work has been done in quantifying the 
merits of various quadrature rules. However, the interested reader should con- 
sider the optimal snapshot sampling strategy developed by Kunisch and Volk- 
wein [419]. 
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Figure 12.7: (a) Translating Gaussian with speed c = 3. The singular value de- 
composition produces a slow decay of the singular values, which is shown on 
a (b) normal and (c) logarithmic plot. 


12.5 POD with Symmetries: Rotations and Transla- 
tions 


The POD method is not without its shortcomings. It is well known in the POD 
community that the underlying SVD algorithm does handle invariances in the 
data in an optimal way. The most common invariances arise from translational 
or rotational invariances in the data. Translational invariance is observed in the 
simple phenomenon of wave propagation, making it difficult for correlation to 
be computed, since critical features in the data are no longer aligned snapshot 
to snapshot. 

In what follows, we will consider the effects of both translation and rota- 
tion. The examples are motivated from physical problems of practical interest. 
The important observation is that, unless the invariance structure is accounted 
for, the POD reduction will give an artificially inflated dimension for the un- 
derlying dynamics. This challenges our ability to use the POD as a diagnostic 
tool or as the platform for reduced-order models. 


Translation: Wave Propagation 


To illustrate the impact of translation on a POD analysis, consider a simple 
translating Gaussian propagating with velocity c: 


u(x,t) = exp[—(x — ct + 15)?). (12.71) 


We consider this solution on the space and time intervals x € [—20,20] and 
t € [0,10]. 

Figure [12.7{a) demonstrates the simple evolution to be considered. As is 
clear from the figure, the translation of the pulse will clearly affect the correla- 
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Figure 12.8: First four (a) spatial modes (first four columns of the U matrix) and 
(b) temporal modes (first four columns of the V matrix). A wave translating at 
a constant speed produces Fourier mode structures in both space and time. 


tion at a given spatial location. Naive application of the SVD does not account 
for the translating nature of the data. As a result, the singular values produced 
by the SVD decay slowly, as shown in Figs.[12.7{b) and (c). In fact, the first few 
modes each contain approximately 8% of the variance. 

The slow decay of singular values suggests that a low-rank embedding is 
not easily constructed. Moreover, there are interesting issues interpreting the 
POD modes and their time dynamics. Figure shows the first four spa- 
tial (U) and temporal (V) modes generated by the SVD. The spatial modes 
are global in that they span the entire region where the pulse propagation oc- 
curred. Interestingly, they appear to be Fourier modes over the region where 
the pulse propagated. The temporal modes illustrate a similar Fourier mode 
basis for this specific example of a translating wave propagating at a constant 
velocity. 

The failure of POD in this case is due simply to the translational invariance. 
If the invariance is removed, or factored out [609], before a data reduction is 
attempted, then the POD method can once again be used to produce a low-rank 
approximation. In order to remove the invariance, the invariance must first be 
identified and an auxiliary variable defined. Thus we consider the dynamics 
rewritten as 

u(x,t) > u(x — c(t)), (12.72) 


where c(t) corresponds to the translational invariance in the system responsible 
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Figure 12.9: Spiral waves (a) u(x, y), (b) |u(zx, y)| and (c) u(x, y)? on the domain 
x € [—20,20] and y € [—20, 20]. The spirals are made to spin clockwise with 
angular velocity w. 


for limiting the POD method. The parameter c can be found by a number of 
methods. Rowley and Marsden propose a template-based technique for 
factoring out the invariance. Alternatively, a simple center-of-mass calculation 
can be used to compute the location of the wave and the variable c(t) [420]. 


Rotation: Spiral Waves 


A second invariance commonly observed in simulations and data is associated 
with rotation. Much like translation, rotation moves a coherent, low-rank struc- 
ture in such a way that correlations, which are produced at specific spatial loca- 
tions, are no longer produced. To illustrate the effects of rotational invariance, 
a localized spiral wave with rotation will be considered. 

A spiral wave centered at the origin can be defined as follows: 


u(x, y) = tanh[ 2? + y? cos(AZ(ax + iy) — Va? + y?)], (12.73) 


where A is the number of arms of the spiral, and the Z denotes the phase angle 
of the quantity (x+iy). To localize the spiral on a spatial domain, it is multiplied 
by a Gaussian centered at the origin so that our function of interest is given by 


f(x,y) = u(x, y) exp[—0.01(x? + y’). (12.74) 


This function creates the rotation structure we wish to consider. The rate 
of spin can be made faster or slower by lowering or raising the value of the 
denominator, respectively. 

In addition to considering the function u(x, y), we will also consider the 
closely related functions |u(x, y)| and u(x, y)? as shown in Fig. Although 
these three functions clearly have the same underlying function that rotates, 
the change in functional form is shown to produce quite different low-rank 
approximations for the rotating waves. 

To begin our analysis, consider the function u(x, y) illustrated in Fig. [12.9[a). 
The SVD of this matrix can be computed and its low-rank structure evaluated. 
Two figures are produced (Figs. and [12.11). The first assesses the rank 
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Figure 12.10: (a) First four temporal modes of the matrix V. To numerical pre- 
cision, all the variance is in the first two modes, as shown by the singular value 
decay on a (b) normal and (c) logarithmic plot. Remarkably, the POD extracts 
exactly two modes (see Fig. to represent the rotating spiral wave. 


of the observed dynamics and the temporal behavior of the first four modes 
in V. Figures [12.10{b) and (c) show the decay of singular values on a regular 
and logarithmic scale, respectively. Remarkably, the first two modes capture 
all the variance of the data to numerical precision. This is further illustrated in 
the time dynamics of the first four modes. Specifically, the first two modes of 
Fig. [12.10{a) have a clear oscillatory signature associated with the rotation of 
modes one and two of Fig. Modes 3 and 4 resemble noise in both time 
and space as a result of numerical round-off. 

The spiral wave allows for a two-mode truncation that is accurate 
to numerical precision. This is in part due to the sinusoidal nature of the so- 
lution when circumnavigating the solution at a fixed radius. Simply changing 
the data from u(x, t) to either |u(z, t)| or u(x, t)° reveals that the low-rank modes 
and their time dynamics are significantly different (see Figs. and [12.13). 
Figures [12.12{a) and (b) show the decay of the singular values for these two 
new functions and demonstrate the significant difference from the two-mode 
evolution previously considered. The dominant time dynamics computed from 
the matrix V are also demonstrated. In the case of the absolute value of the 
function |u(x, t)|, the decay of the singular values is slow and never approaches 
numerical precision. The quintic function suggests a rank r = 6 truncation is 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


570 CHAPTER 12. REDUCED-ORDER MODELS (ROMS) 


Figure 12.11: First four POD modes associated with the rotating spiral wave 
u(x, y). The first two modes capture all the variance to numerical precision, 
while the third and fourth mode are noisy due to numerical round-off. The 
domain considered is x € [—20, 20] and y € [—20, 20]. 
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Figure 12.12: Decay of the singular values on a (a) normal and (b) logarithmic 
scale showing that the function |u(,t)| produces a slow decay while u(x,t)’ 
produces an r = 6 approximation to numerical accuracy. The first four temporal 
modes of the matrix V are shown for these two functions in panels (c) and (d), 
respectively. The spatial modes are shown in Fig. 
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eo 


Figure 12.13: First four POD modes associated with the rotating spiral wave 

|u(a, y)| (top row) and u(x,t)” (bottom row). Unlike our previous example, the 

first four modes do not capture all the variance to numerical precision, thus 

requiring more modes for accurate approximation. The domain considered is 
€ [—20, 20] and y € [—20, 20]. 


capable of producing an approximation to numerical precision. This highlights 
the fact that rotational invariance complicates the POD reduction procedure. 
After all, the only difference between the three rotating solutions is the actual 
shape of the rotating function, as they are all rotating with the same speed. 

To conclude, invariance can severely limit the POD method. Most notably, 
it can artificially inflate the dimension of the system and lead to compromised 
interpretability. Expert knowledge of a given system and its potential invari- 
ances can help frame mathematical strategies to remove the invariances, i.e., 
re-aligning the data [420,1609]. But this strategy also has limitations, especially 
if two or more invariant structures are present. For instance, if two waves of 
different speeds are observed in the data, then the methods proposed for re- 
moving invariances will fail to capture both wave speeds simultaneously. Ulti- 
mately, dealing with invariances remains an open research question. 


12.6 Neural Networks for Time-Stepping with POD 


The emergence of machine learning is expanding the mathematical possibil- 
ities for the construction of accurate ROMs. As shown in the previous sec- 
tions, the focus of traditional projection-based ROMs is on computing the low- 
dimensional subspace W on which to project the governing equations. Recall 
that, in constructing the low-dimensional subspace, the SVD is used on snap- 
shots of high-fidelity simulation (or experimental) data X ~ YSV*. The POD 
reduction technique uses only the single matrix W in the reduction process. The 
temporal evolution in the reduced space W is quantified by © V*. This gives ex- 
plicitly the evolution of each mode over the snapshots of X, information that is 
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not used in projection-based ROMs. Neural networks can then be used directly 
on the time-series data encoded in V to build a time-stepping algorithm for 
marching the solution forward in time. 

The motivation for using deep learning algorithms for time-stepping is the 
recognition that projection-based model reduction often can produce unstable 
iteration schemes [160]. A second important fact is that valuable temporal infor- 
mation in the low-dimensional space is summarily dismissed by the projection 
schemes, i.e., only the POD modes are retained for ROM construction. Neural 
networks aim to leverage the temporal information and in the process build ef- 
ficient and stable time-stepping proxies. Recall that model reduction proceeds 
by projecting into the low-dimensional subspace spanned by W so that 


u(t) © Wa(t). (12.75) 


In the projection-based ROMs of previous sections, the amplitude dynamics 
a(t) are constructed by Galerkin projection of the governing equations onto WV. 
With neural networks, the dynamics a(t) are approximated from the discrete 
time-series data encoded in V. Specifically, this gives 


alt) = EV*= |a a © am (12.76) 
| | | 


over the m time snapshots of the original data matrix on which the ROM is to 
be constructed. 

Deep learning algorithms provide a flexible framework for constructing a 
mapping between successive time-steps. As shown in Fig. the typical 
ROM architecture constrains the dynamics to a subspace spanned by the POD 
modes W. Thus in the original coordinate system, the high-fidelity simulations 
of the governing equations for u are solved with a given numerical discretiza- 
tion scheme to produce a snapshot matrix X containing ux. In the new coordi- 
nate system, which is generated by projection to the subspace W, the snapshot 
matrix is now constructed from a, as shown in (12.76). In traditional ROMs, 
the snapshot matrix is not used. Instead snapshots of a, are achieved 
by solving the Galerkin projected model (12.21). However, the snapshot matrix 
can be used to construct a time-stepping model using neural networks. 
Neural networks allow one to use the high-fidelity simulation data to train a 
mapping 

ak+ı = fo(ax), (12.77) 
where fg is a generic representation of a neural network which is characterized 
by its structure, weights, and biases. Note that deep learning can also be used 
to learn nonlinear coordinates that generalize the SVD embedding. 

Recently, Parish and Carlberg and Regazzoni et al. developed 
a suite of neural-network-based methods for learning time-stepping models 
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J: = 
F i da/dt = ©’LWa + UW’ N(Wa, 3) 


Traditional ROM 
u; = N(u, Uz, Urz, . . . , £, t; B) 


Figure 12.14: Illustration of neural network integration with POD subspaces. 
The autoencoder structure projects the original high-dimensional state-space 
data into a low-dimensional space via u(t) ~ Wa(t). As shown in the bottom 
left, the snapshots u, are generated by high-fidelity numerical solutions of the 
governing equations u, = N(u,u,,Uzz,...,2,t; 8). In traditional ROMs, the 
snapshots a, are constructed from Galerkin projection as shown in the bottom 
right. Neural networks instead learn a mapping a;,,; = fo(a;,) from the original, 
low-dimensional snapshot data. It should be noted that time-stepping Runge- 
Kutta schemes, for instance, are a form of feedforward neural networks, which 
are used to produce the original high-fidelity data snapshots ux [289]. 


for (12.77). Moreover, they provide extensive comparisons between different 
neural network architectures along with traditional techniques for time-series 
modeling. In such models the neural networks (or time-series analysis meth- 
ods) simply map an input (a4) to an output (a; +1) as in Section [6.6| Autoen- 
coders can also replace the POD embedding above. 

In its simplest form, the neural network training requires input-output pairs 
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that can be generated from snapshots ax. Thus two matrices can be constructed: 


A= |a a © amı and A’ = |a ag = aml, (12.78) 
|| | | | 


where A denotes the input and A’ denotes the output. This gives the training 
data necessary for learning (optimizing) a neural network map: 


A! = f(A). (12.79) 


There are numerous neural network architectures that can learn the mapping 
fọ. In Section|6.6} a simple feedforward network was already shown to be quite 
accurate in learning such a model. Further sophistication can improve accuracy 
and reduce data requirements for training. 

Regazzoni et al. formulated the optimization of in terms of 
maximum likelihood. Specifically, they considered the most suitable represen- 
tation of the high-fidelity model in terms of simpler neural network models. 
They show that such neural network models can approximate the solution to 
within any accuracy required (limited by the accuracy of the training data, 
or course) simply by constructing them from the input-output pairs given by 
(12.79). Parish and Carlberg provide an in-depth study of different neu- 
ral network architectures that can be used for learning the time-steppers. They 
are especially focused on recurrent neural network (RNN) architectures that have 
proven to be so effective in temporal sequences associated with language [290]. 
Their extensive comparisons show that long short-term memory (LSTM) 
neural networks outperform other methods and provide substantial improve- 
ments over traditional time-series approaches such as autoregressive models. 
In addition to a baseline Gaussian process (GP) regression, they specifically com- 
pare time-stepping models that include the following: k-nearest neighbors (KNN), 
artificial neural networks (ANN), autoregressive with exogenous inputs (ARX), 
integrated ANN (ANN-D), latent ARX (LARX), RNN, LSTM, and standard GP. 
Some models include recursive training (RT) and others do not (NRT). Their 
comparisons on a diversity of PDE models, which will not be detailed here, 
are evaluated on the fraction of variance unexplained (FVU). Figure gives 
a representation of the extensive comparisons made on these methods for an 
advection—diffusion PDE model. 

The success of neural networks for learning time-stepping representations 
fits more broadly under the aegis of flow maps [753], which were introduced in 
Section 

URI = F(u,). (12.80) 


For neural networks, the flow map is approximated by the learned model (12.77) 
so that F = fọ. Qin et al. [575] and Liu et al. [449] have explored the construc- 
tion of flow maps from neural networks as yet another modeling paradigm for 
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(a) Normed state error ôs (b) Qol error ôs 


Figure 12.15: Comparison of a diversity of error metrics and methods for con- 
structing the mapping for advection-diffusion equations. In all mod- 
els considered in the paper, the LSTM and RNN structures proved to be the 
most accurate models for time-stepping. The reader is encouraged to consult 
the original paper for the details of the underlying models, the error metrics 
displayed, and the training data used. Python codes are available in the ap- 
pendix of the original paper. From Parish and Carlberg [548]. 


advancing the solution in time without recourse to high-fidelity simulations. 
Such methods offer a broader framework for fast time-stepping algorithms, as 
no initial dimensionality reduction needs to be computed. In Qin et al. [575], the 
neural network model fg is constructed with a residual network (ResNet) as the 
basic architecture for approximation. In addition to a one-step method, which 
is shown to be exact in temporal integration, a recurrent ResNet and recursive 
ResNet are also constructed for multiple time-steps. Their formulation is also in 
the weak form where no derivative information is required in order to produce 
the time-stepping approximations. Several numerical examples are presented 
to demonstrate the performance of the methods. Like Parish and Carlberg 
and Regazzoni et al. [595], the method is shown to be exceptionally accurate 
even in comparison with direct numerical integration, highlighting the quali- 
ties of the universal approximation properties of fo. 

Liu et al. leveraged the flow map approximation scheme to learn a 
multi-scale time-stepping scheme. Specifically, one can learn flow maps for dif- 
ferent characteristic timescales. Thus a given model 


anys = fo, (az) (12.81) 
can learn a flow map over a prescribed timescale 7. If there exist distinct timescales 


in the data, for instance denoted by tı, tz, and t3 with ti > tə > tz (slow, 
medium, and fast times), then three models can be learned: f,, fọ, and fo, for 
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Figure 12.16: Multi-scale hierarchical time-stepping scheme. Neural net- 
work representations of the time-steppers are constructed over three dis- 
tinct timescales. The red model takes large steps (slow timescale fg,), leaving 
the finer time-stepping to the yellow (medium timescale fg,) and blue (fast 
timescale fg,) models. The dark path shows the sequence of maps from u; to 
um. Modified from Liu et al. [449]. 


the slow, medium, and fast times, respectively. Figure [12.16|shows the hierarchi- 
cal time-stepping (HiTS) scheme with three distinct timescales. The training data 
of a high-fidelity simulation, or collection of experimental data, allow for the 
construction of flow maps, which can then be used to efficiently forecast long 
times into the future. Specifically, one can use the flow map constructed on the 
slowest scale fg, to march far into the future, while the medium and fast scales 
are then used to advance to the specific point in time. Thus a minimal number 
of steps is taken on the fast scale, and the work of forecasting long into the fu- 
ture is done by the slow and medium scales. The method is highly efficient and 
accurate. 

Figure[12.17)compares the HiTS scheme across a number of example prob- 
lems, some of which are videos and music frames. Thus HiTS does not require 
governing equations, simply time-series data arranged into input-output pairs. 
The performance of such flow maps is remarkably robust, stable, and accurate, 
even when compared to leading time-series neural networks such as LSTMs, 
echo state networks (ESNs), and clockwork recurrent neural networks (CW-RNNs). 
This is especially true for long forecasts, in contrast to the small time-steps eval- 
uated in the work of Parish and Carlberg [548]. 

Overall, the works of Parish and Carlberg [548], Regazzoni et al. [595], Qin 
et al. [575], and Liu et al. exploit very simple training paradigms related 
to input-output pairings of temporal snapshot data as structured in (12.78). 
This provides a significant potential improvement for learning time-stepping 
proxies to the Galerkin projected models such as (12.21). 
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Figure 12.17: Evaluation of different neural network architectures (columns) on 
each training sequence (rows). Key diagnostics are visualized from a diversity 
of examples, including music files and videos. The last frame of the reconstruc- 
tion is visualized for the first, third, and fourth examples, while the entire mu- 
sic score is visualized in the second example. Note the superior performance of 
the hierarchical time-stepping scheme in comparison with other modern neu- 
ral network models such as LSTMs, echo state networks (ESNs), and clockwork 
recurrent neural networks (CW-RNNSs). From Liu et al. [449]. The code is publicly 


available athttps://github.com/luckystarufo/multiscale_HiTS 
12.7 Leveraging DMD and SINDy for POD-Galerkin 


The construction of a traditional ROM that is accurate and efficient is centered 
on the reduction (12.21). Thus, once a low-rank subspace is computed from 
the SVD, the POD modes W are used for projecting the dynamics. In the last 
section, projection of the governing evolution equations was circumvented by 
simply learning a neural network for the temporal (time-stepping) evolution. In 
this section, we use data-driven, non-intrusive methods in order to regress to a 
model for the temporal dynamics. Consider the evolution dynamics in (12.13): 


u; = Lu + N(u, ug, Uge,..-, £, t; B) (12.82) 


where the linear and nonlinear parts of the evolution, denoted by L and N(-), 
respectively, have been explicitly separated. The solution ansatz u = Wa yields 


the ROM A 
T = W'LWa + UN(Pa, B). (12.83) 
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Note that the linear operator in the reduced space W’LW is an r x r matrix, 
which is easily computed. The nonlinear portion of the operator ©’ N(Wa, 3) 
is more complicated since it involves repeated computation of the operator as 
the solution a, and consequently the high-dimensional state u, is updated in 
time. Efficient interpolation methods for computing the nonlinear contribution 
in the ROM model are explored extensively in the next chapter of this book. 


Simplifying POD-Galerkin with DMD 


One method for overcoming the difficulties introduced in evaluating the non- 
linear term on the right-hand side is to introduce the DMD algorithm. DMD 
approximates a set of snapshots by a best-fit linear model. Thus the nonlinear- 
ity can be evaluated over snapshots and a linear model constructed to approx- 
imate the dynamics. Thus two matrices can be constructed: 


| 
N= Ni No eee Nm-ı and N’ = No N3 Sumas Nin (12.84) 
| | | | | | 


where N; is the evaluation of the nonlinear term N(u, uz, Uz2,...,2,t; B) att = 
t. Here N denotes the input and N’ denotes the output. This gives the training 
data necessary for regressing toa DMD model, 


N’=ANN. (12.85) 
The governing equation (12.86) can then be approximated by 
u, % Lu + Anu = (A+ An)u, (12.86) 


where the operator L has been replaced by A. The dynamics is now completely 
linear and solutions can be easily constructed from the eigenvalues and eigen- 
vectors of the linear operator A + An. 

In practice, the DMD algorithm highlighted in Section|7.2|also exploits low- 
dimensional structure in building a ROM model. Thus instead of the approxi- 
mate linear model (12.86), we instead wish to build a low-dimensional version. 
From snapshots asap the nonlinearity, the DMD algorithm can be used to 
approximate the dominant rank-r nonlinear contribution to the dynamics as 


IN GY, Uz, Ure,- £, t B) S ` bb; exp(wjt) = @ exp(Nt)b, (12.87) 
j=l 


where b; determines the weighting of each mode. Here ¢; is the DMD mode 
and w; is the DMD eigenvalue. This approximation can be used in (12.88) to 
produce the POD-DMD approximation: 

da 


~ PTLYa + O'S exp(Nt)b. (12.88) 
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Figure 12.18: Computation time and accuracy on a semi-linear parabolic equa- 
tion. Four methods are compared: the high-fidelity simulation of the govern- 
ing equations (FULL); a Galerkin-POD reduction as given in (POD); 
a Galerkin—POD reduction with the discrete empirical interpolation (DEIM) al- 
gorithm for evaluation of the nonlinearity (POD-DEIM); and the POD-DMD 
approximation (12.88). The left panel shows the computation times, which are 
an order of magnitude faster than for traditional POD-DEIM algorithms. The 
right panel shows the accuracy of the different methods for reproducing the 
high-fidelity simulations. POD-DMD loses some accuracy in comparison to 
Galerkin-POD methods due to the fact that DMD modes are not orthogonal, 
and thus the error does not decrease as quickly as in the POD-based methods. 
Modified from Alla and Kutz [10]. 


In this formulation, there are a number of advantageous features: (i) The non- 
linearity is only evaluated once with the DMD algorithm (12.87). (ii) The prod- 
ucts WLW and W’@ are also only evaluated once and both produce matrices 
that are low-rank, i.e., they are independent of the original high-rank system. 
Thus with a one-time, up-front evaluation of two snapshot matrices to produce 
W and ®, the DMD produces a computationally efficient ROM that requires no 
recourse to the original high-dimensional system. 

Alla and Kutz integrated the DMD algorithm into the traditional ROM 
formalism to produce the POD-DMD model (12.88). The comparison of this 
computationally efficient ROM with traditional model reduction is shown in 
Fig. Specifically, both the computational time and error are evaluated us- 
ing this technique. Once the DMD algorithm is used to produce an approxima- 
tion of the nonlinear term, it can be used for producing future state predictions 
and a computationally efficient ROM. Indeed, its computational acceleration is 
quite remarkable in comparison to traditional methods. Moreover, the method 
is non-intrusive and does not require additional evaluation of the nonlinear 
term. The entire method can be used with randomized algorithms to speed up 
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the low-rank evaluations even further [11]. Note that the computational perfor- 
mance boost comes at the expense of accuracy, as shown in Fig. This is 
primarily due to the fact that additional POD modes used for standard ROMs, 
which are orthogonal by construction and guaranteed to be a best fit in 4, are 
now replaced by DMD modes which are no-longer orthogonal [10]. 


SINDy for POD-Galerkin Regression 


The SINDy regression framework also allows one to build a parsimonious model 
for the evolution of the temporal dynamics in the low-rank subspace. Sec- 
tion|7.3| highlights the SINDy algorithm for model discovery. In the context of 
ROMs, the goal is now to discover a model of the evolution dynamics of a high- 
fidelity model embedded in a low-rank subspace. Recall that u(t) ~ Wa(t), Y 
can be computed with the SVD. The evolution of a(t) ultimately determines 
the temporal behavior of the system. Thus far, the temporal evolution has been 
computed via Galerkin projection and DMD. SINDy gives yet another alterna- 


tive to model j 

Ta f(a), (12.89) 
where the right-hand side function prescribing the evolution dynamics f(-) is 
unknown. SINDy provides a sparse regression framework to determine this 


dynamics. The snapshots of a(t) are collected into the matrix 


| | 
A= |a a © aml, (12.90) 
|| 


and the SINDy regression framework is then formulated as 
A = O(A)E, (12.91) 


where each column €, in & is a vector of coefficients determining the active 
terms in the kth row in (12.89). Asin Section|7.3] leveraging parsimony provides 
a dynamical model using as few terms as possible in =. Such a model may be 
identified using a convex ¢-regularized sparse regression: 


£, = argming, || Àr — O(A)&zll2 + AllEnl- (12.92) 


Note that a; is the kth column of A, and A is a sparsity-promoting hyperparam- 
eter. Section|7.3|discusses the many variants for sparsity promotion that can be 
used [717], including the advocated sequential 
least-squares thresholding to select active terms. 

Applying SINDy to POD mode coefficients provides a simple regression 
framework for discovering a parsimonious, and generally nonlinear, model for 
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the evolution dynamics of the high-dimensional system in a low-dimensional 
subspace. This approach, called sparse Galerkin regression [455], was illus- 
trated in Fig. [7.6]in Section|7.3} For that example, the canonical example of flow 
past a circular cylinder was considered. This is modeled by the two-dimensional, 
incompressible Navier-Stokes equations: 


1 


V-u=0, ðu + (u: V)u = -Vp + zz 


Au, (12.93) 
where u is the two-component flow velocity field in 2D and p is the pressure 
term. For Reynolds number Re = Re. ~ 47, the fluid flow past a cylinder 
undergoes a supercritical Hopf bifurcation, where the steady flow for Re < 
Re, transitions to unsteady vortex shedding [60]. The unfolding gives the cel- 
ebrated Stuart-Landau ODE, which is essentially the Hopf normal form in 
complex coordinates. This has resulted in accurate and efficient reduced-order 
models for this system [524,526]. 

In Fig. simulations at Re = 100 were considered. The snapshots of the 
evolution dynamics can be collected as in (12.16). Noack et al. showed 
that the first two SVD modes and a third, orthogonal shift mode capture the 
essential dynamics of this flow. In these coordinates, the discovered dynamical 
model is given by 


ay = pa, — wag + Aajza3, (12.94a) 
G2 = Wa, + pag + Aazaz, (12.94b) 
åz = —\(a3 — a? — ad), (12.94c) 


which is the same as was found by Noack et al. through a detailed asymp- 
totic reduction of the flow dynamics. Thus the ROM evolution dynamics 
provide a non-intrusive, purely data-driven path to discover models similar to 
those achieved via Galerkin—POD projection. Not only is this model stable, but 
it also captures the correct supercritical Hopf bifurcation dynamics as a func- 
tion of Reynolds number. Loiseau et al. also showed that it is possible 
to incorporate partially known physics, such as the energy preserving skew- 
symmetry of the quadratic terms, as constraints in the SINDy regression. 
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Homework 


Exercise 12-1. Using the flow around a cylinder data, compute the singular 
value spectrum and POD modes using the standard method of snapshots (us- 
ing the trapezoidal rule when discretizing the continuous formulation) and 
compare it against the modes and spectrum when using Simpson’s rule and 
Boole’s rule for integration. Quantify the difference in modes using least-squares 
error. 


Exercise 12-2. Generate high-fidelity, well-resolved solutions for the Kuramoto- 
Sivashinsky (KS) equation in a parameter regime where spatio-temporal chaos 
is exhibited. Build a number of reduced-order models using the high-fidelity 
data and test the models for future state prediction. 


(a) Compute the leading POD modes and produce a rank-r Galerkin—POD 
approximation of the PDE evolution. 


(b) Compute a ROM by using the snapshots to produce a rank-r DMD model 
to characterize the evolution. 


(c) Compute a POD-DMD model where the nonlinear terms are approxi- 
mated using a DMD model. 


(d) Learn both a feedforward and LSTM network for advancing the solution 
in time in a rank-r POD basis. 


Compare the various architectures in terms of their stability and future state 
prediction capabilities. Investigate the dynamics as a function of the rank re- 
duction parameter r. Repeat the experiments by adding noise to the high-fidelity 
simulation data. Repeat the experiments yet again with both noise and added 
outliers (corruption) to the high-fidelity simulation data. 


Exercise 12-3. Learn a deep neural network autoencoder to build a linear model 
for flow around a cylinder. Use high-fidelity flow around a cylinder data to 
learn a coordinate (autoencoder) transformation to an (r = 3)-dimensional sub- 
space where the dynamics is linear and a Koopman operator can be constructed 
[465]. In the new linear coordinates, compute the eigenvalues and eigenvectors 
of the latent state representation. Use the model to forecast the future state and 
compare with the high-fidelity simulations. 


Exercise 12-4. Learn a deep neural network autoencoder to build a parsimo- 
nious, but nonlinear, model for flow around a cylinder. Use high-fidelity flow 


around a cylinder data to learn a coordinate (autoencoder) transformation to an 
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(r = 3)-dimensional subspace where the dynamics is given by a parsimonious 
dynamical system [168]. In the new linear coordinates, compute the eigenval- 
ues and eigenvectors of the latent state representation. Use the model to fore- 
cast the future state and compare with the high-fidelity simulations. 
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Chapter 13 


Interpolation for Parametric Reduced- 
Order Models 


In the last chapter, the mathematical framework of ROMs was outlined. Specif- 
ically, Chapter[12]has already highlighted the POD method for projecting PDE 
dynamics to low-rank subspaces where simulations of the governing PDE model 
can be more readily evaluated. However, the complexity of projecting into the 
low-rank approximation subspace remains challenging due to the nonlinear- 
ity. Interpolation in combination with POD overcomes this difficulty by pro- 
viding a computationally efficient method for discretely (sparsely) sampling 
and evaluating the nonlinearity. This chapter leverages the ideas of the sparse 
and compressive sampling algorithms of Chapter [3] where a small number of 
samples are capable of reconstructing the low-rank dynamics of PDEs. Ulti- 
mately, these methods ensure that the computational complexity of ROMs scale 
favorably with the rank of the approximation, even for complex nonlinearities. 
The primary focus of this chapter is to highlight sparse interpolation methods 
that enable a rapid and low-dimensional construction of the ROMs. In prac- 
tice, these techniques dominate the ROM community since they are critically 
enabling for evaluating parametrically dependent PDEs where frequent ROM 
model updates are required. 


13.1 Gappy POD 


The success of nonlinear model order reduction is largely dependent upon two 
key innovations: (i) the well-known Galerkin-POD method [738], 
which is used to project the high-dimensional nonlinear dynamics onto a low- 
dimensional subspace in a principled way; and (ii) sparse sampling of the state 
space for interpolating the nonlinear terms required for the subspace projec- 
tion. Thus sparsity is already established as a critically enabling mathemati- 
cal framework for model reduction through methods such as gappy POD and 
its variants [767]. Indeed, efficiently managing the compu- 
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tation of the nonlinearity was recognized early on in the ROMs community, 
and a variety of techniques were proposed to accomplish this task. Perhaps the 
first innovation in sparse sampling with POD modes was the technique pro- 
posed by Everson and Sirovich for which the gappy POD moniker was derived 
[239]. In their sparse sampling scheme, random measurements were used to 
approximate inner products. Principled selection of the interpolation points, 
through the gappy POD infrastructure or missing point 
(best points) estimation (MPE) [29], [522], was quickly incorporated into ROMs 
to improve performance. More recently, the empirical interpolation method 
(EIM) and its most successful variant, the POD-tailored discrete empiri- 
cal interpolation method (DEIM) [171], have provided a greedy algorithm that 
allows for nearly optimal reconstructions of nonlinear terms of the original 
high-dimensional system. The DEIM approach combines projection with in- 
terpolation. Specifically, DEIM uses selected interpolation indices to specify an 
interpolation-based projection for a nearly optimal £2 subspace approximating 
the nonlinearity. 

The low-rank approximation provided by POD allows for a reconstruction 
of the solution u(x,t) in (13.9) with r measurements of the n-dimensional state. 
This viewpoint has profound consequences on how we might consider mea- 
suring our dynamical system [239]. In particular, only r < n measurements 
are required for reconstruction, allowing us to define the sparse representation 
variable ù € C”, 


ù = Pu, (13.1) 


where the measurement matrix P € R"*” specifies r measurement locations of 
the full state u € C”. As an example, the measurement matrix might take the 
form 


1 0 0 
0 0 1 0 0 

P=| 0 -> 0 1 0 0], (13.2) 
>: 0 0 01 ; 
0 0 00 1 


where measurement locations take on the value of unity and the matrix ele- 
ments are zero elsewhere. The matrix P defines a projection onto an r-dimensional 
space ŭ that can be used to approximate solutions of a PDE. 

The insight and observation of forms the basis of the gappy POD 
method introduced by Everson and Sirovich [239]. In particular, one can use 
a small number of measurements, or gappy data, to reconstruct the full state of 
the system. In doing so, we can overcome the complexity of evaluating higher- 
order nonlinear terms in the POD reduction. 
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Sparse Measurements and Reconstruction 


The measurement matrix P allows for an approximation of the state vector u 
from r measurements. The approximation is obtained by using (13.1) with the 
standard POD projection: 


aL PY arp, (13.3) 
k=1 


where the coefficients a), minimize the error in approximation: || — Pul|. The 
challenge now is how to determine the a; given that taking inner products of 
can no longer be performed. Specifically, the vector u has dimension r 
whereas the POD modes have dimension n, i.e., the inner product requires in- 
formation from the full range of x, the underlying discretized spatial variable, 
which is of length n. Thus, the modes w,,(x) are in general not orthogonal over 
the r-dimensional support of ù. The support will be denoted as s|u]. More pre- 
cisely, orthogonality must be considered on the full range versus the support 
space. Thus the following two relationships hold: 


Mri = (Wes Pj) = Ôk, (13.4a) 
Mrs = (Wx, Yisa AO for all k, j, (13.4b) 


where Mx; are the entries of the Hermitian matrix M and 6,;,; is the Kronecker 
delta function. The fact that the POD modes are not orthogonal on the support 
s[ūù] leads us to consider alternatives for evaluating the vector a. 

To determine the a;, a least-squares algorithm can be used to minimize the 
error, 


r 2 
p= f E -> za dx, (13.5) 
s[ù] k=1 


where the inner product is evaluated on the support sfù], thus making the two 
terms in the integral of the same size r. The minimizing solution to (13.5) re- 
quires the residual to be orthogonal to each mode w,,, so that 


(a- Saws, =0 forj#k, j=1,2,...,r. (13.6) 
k=1 


s[t] 


In practice, we can project the full-state vector u onto the support space and 
determine the vector a: 
Ma =f, (13.7) 


where the elements of M are given by (13.4p) and the components of the vector 
f are given by 


fe = (U, Wy) s[a- (13.8) 
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Note that if the measurement space is sufficiently dense, or if the support space 
is the entire space, then M = I, implying that the eigenvalues of M approach 
unity as the number of measurements becomes dense. Once the vector a is de- 
termined, a reconstruction of the solution can be performed as 


u(x,t) © Wa. (13.9) 


As the measurements become dense, not only does the matrix M converge to 
the identity, but also a — a. Interestingly, these observations lead us to consider 
the efficacy of the method and/or approximation by considering the condition 
number of the matrix M [711]: 


m(M) = IMI |M] = =. (13.10) 
Here the 2-norm has been used. If «(M) is small, then the matrix is said to be 
well conditioned. A minimal value of «(M) is achieved with the identity matrix 
M = I. Thus, as the sampling space becomes dense, the condition number also 
approaches unity. This can be used as a metric for determining how well the 
sparse sampling is performing. Large condition numbers suggest poor recon- 
struction, while values tending toward unity should perform well. 


Harmonic Oscillator Modes 


To demonstrate the gappy sampling method and its reconstruction efficacy, we 
apply the technique using the first 10 modes of the Gauss—Hermite functions 
defined by and (12.26). To compute the second derivative, we use the 
fact that the Fourier transform F can produce a spectrally accurate approxima- 
tion, i.e., Ur, = F~'[(ik)?Fu]. For the sake of producing accurate derivatives, 
we consider the domain x € [—10, 10] but then work with the smaller domain 
of interest x € [—4, 4]. Recall further that the Fourier transform assumes a 27- 
periodic domain. This is handled by a scaling factor in the k wavevectors. The 
first five modes have been demonstrated in Fig. 

The mode construction is shown in the top panel of Fig. Each colored 
cell represents the discrete value of the mode in the interval x € [—4,4] with 
Ax = 0.1. Thus there are 81 discrete values for each of the modes w,. Our 
objective is to reconstruct a function outside of the basis modes of the harmonic 
oscillator. In particular, consider the function 


f(x) = exp[—(z — 0.5)?] + 3exp[—2(2 + 3/2)7), (13.11) 


which will be discretized and defined over the same domain as the modal basis 
of the harmonic oscillator. We construct this function and further numerically 
construct the projection of the function onto the basis functions w,,. The original 
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Figure 13.1: The top panel shows the first nine modes of the quantum har- 
monic oscillator considered in and (12.26). Three randomly generated 
measurement matrices, P;, with j = 1, 2, and 3, are depicted. There is a 20% 
chance of performing a measurement at a given spatial location x, in the inter- 
val x € |—4, 4] with a spacing of Ax = 0.1. 


function is plotted in the top panel of Fig. Note that the goal now is to 
reconstruct this function both with a low-rank projection onto the harmonic 
oscillator modes, and with a gappy reconstruction whereby only a sampling of 
the data is used, via the measurements P;. A test function is reconstructed in 
the 10-mode harmonic oscillator basis. Further, it builds the matrix M for the 
full-state measurements and computes its condition number. 

Results of the low-rank and gappy reconstruction are shown in Fig. 
The low-rank reconstruction is performed using the full measurements pro- 
jected to the 10 leading harmonic oscillator modes. In this case, the inner prod- 
uct of the measurement matrix is given by (13.4h) and is approximately the 
identity. The fact that we are working on a limited domain x € [—4, 4] with a 
discretization step of Ax = 0.1 is what makes M ~ I versus being exactly the 
identity. For the three different sparse measurement scenarios P; of Fig. 
the reconstruction is also shown along with the least-squares error and the log- 
arithm of the condition number log(«(M,)). We also visualize the three matrices 
M; in Fig.[13.3] The condition number of each of these matrices helps determine 
its reconstruction accuracy. 
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Figure 13.2: (top) The original function (black) along with a 10-mode recon- 
struction of the test function f(x) = exp|—(x — 0.5)?] + 3exp[—2(a + 3/2)?] sam- 
pled in the full space ((a), red) and three representative support spaces s/t] of 
Fig. [13.1] specifically (b) P4, (c) Pz, and (d) P3. Note that the error measurement 
is specific to the function being considered, whereas the condition number met- 
ric is independent of the specific function. Although both can serve as proxies 
for performance, the condition number serves for any function, which is ad- 
vantageous. 


13.2 Error and Convergence of Gappy POD 


As was shown in the previous section, the ability of the gappy sampling strat- 
egy to accurately reconstruct a given function depends critically on the place- 
ment of the measurement (sensor) locations. Given the importance of this issue, 
we will discuss a variety of principled methods for placing a limited number 
of sensors in detail in subsequent sections. Our goal in this section is to investi- 
gate the convergence properties and error associated with the gappy method as 
a function of the percentage of sampling of the full system. Random sampling 
locations will be used. 

Given our random sampling strategy, the results that follow will be sta- 
tistical in nature, computing averages and variances for batches of randomly 
selected sampling. The modal basis for our numerical experiments are again 
the Gauss—Hermite functions defined by and (12.26), and shown in the 


top panel of Fig. 
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Figure 13.3: Demonstration of the deterioration of the orthogonality of the 
modal basis in the support space s[t] as given by the matrix M defined in (13.4). 
The top left shows that the identity matrix is produced for full measurements, 
or nearly so but with errors due to truncation of the domain over x € [—4, 4]. 
The matrices M,, which no longer look diagonal, correspond to the sparse sam- 
pling matrices P} in Fig. Thus it is clear that the modes are not orthogonal 
in the support space of the measurements. 


Random Sampling and Convergence 


Our study begins with random sampling of the modes at a level of 10%, 20%, 
30%, 40%, 50%, and 100%, respectively. The latter case represents the idealized 
full sampling of the system. As one would expect, the error and reconstruction 
are improved as more samples are taken. To show the convergence of the gappy 
sampling, we consider two error metrics: (i) the l error between our randomly 
subsampled reconstruction, and (ii) the condition number of the matrix M for 
a given measurement matrix P;. Recall that the condition number provides a 
way to measure the error without knowing the truth, i.e., (13.11). 
Figure[13.4|depicts the average over 1000 trials of the logarithm of the least- 
squares error, log(£+1) (unity is added to avoid negative numbers), and the log 
of the condition number, log(«(M)), as a function of percentage of random mea- 
surements. Also depicted is the variance ø, with the red bars denoting u + ø, 
where u is the average value. The error and condition number both perform 
better as the number of samples increases. Note that the error does not ap- 
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Figure 13.4: Logarithm of the least-squares error, log(E +1) (unity is added to 
avoid negative numbers), and the log of the condition number, log(«(M)), as a 
function of percentage of random measurements. For 10% measurements, the 
error and condition number are largest, as expected. However, the variance of 
the results, depicted by the red bars, is also quite large, suggesting that the per- 
formance for a small number of sensors is highly sensitive to their placement. 


proach zero since only a 10-mode basis expansion is used, thus limiting the ac- 
curacy of the POD expansion and reconstruction even with full measurements. 
We draw over 1000 random sensor configurations (see Fig. using 10%, 
20%, 30%, 40%, and 50% sampling. The full reconstruction (100% sampling) is 
used to make the final graphic for Fig. Note that, as expected, the error 
and condition number trends are similar, thus supporting the hypothesis that 
the condition number can be used to evaluate the efficacy of the sparse mea- 
surements. Indeed, this clearly shows that the condition number provides an 
evaluation that does not require knowledge of the function in (13.11). 


Gappy Measurements and Performance 


We can continue this statistical analysis of the gappy reconstruction method by 
looking more carefully at 200 random trials of 20% measurements. Figure [13.5] 
shows three key features of the 200 random trials. In particular, as shown in the 
top panel of this figure, there is a large variance in the distribution of the con- 
dition number «(M) for 20% sampling. Specifically, the condition number can 
change by orders of magnitude with the same number of sensors, but simply 
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Figure 13.5: Statistics of 20% random measurements considered in Fig. 
Panel (a) depicts 200 random trials and the condition number log(k(M)) of 
each trial. Histograms of (b) the logarithm of the least-squares error, log(£ + 1), 
and (c) the condition number, log(«(M)), are also depicted for the 200 trials. 
The panels illustrate the extremely high variability generated from the random, 
sparse measurements. In particular, 20% measurements can produce both ex- 
ceptional results and extremely poor performance depending upon the mea- 
surement locations. The measurement vectors P that generate these statistics 


are depicted in Fig. 


placed in different locations. A histogram of the distribution of the log error 
log(E + 1) and the log of the condition number are shown in the bottom two 
panels. The error appears to be distributed in an exponentially decaying fash- 
ion whereas the condition number distribution is closer to a Gaussian. There are 
distinct outliers whose errors and condition numbers are exceptionally high, 
suggesting sensor configurations to be avoided. 

In order to visualize the random, gappy measurements of the 200 samples 
used in the statistical analysis of Fig. we plot the P; measurement masks in 
each row of the matrix in Fig. The white regions represent regions where 
no measurements occur. The black regions are where the measurements are 
taken. These are the measurements that generate the orders-of-magnitude vari- 
ance in the error and condition number. 

As a final analysis, we can sift through the 200 random measurements of 
Fig. PTT pick out both the 10 best and 10 worst measurement vectors P}. 
Figure shows the results of this sifting process. The top two panels depict 
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Figure 13.6: Depiction of the 200 random 20% measurement vectors P; consid- 
ered in Fig. Each row is a randomly generated measurement trial (from 
1 to 200) while the columns represent their spatial location on the domain 
x € [—4,4] with Az = 0.1. 


the best and worst measurement configurations. Interestingly, the worst mea- 
surements have long stretches of missing measurements near the center of the 
domain where much of the modal variance occurs. In contrast, the best mea- 
surements have well-sampled domains with few long gaps between measure- 
ment locations. The bottom panel shows that the best measurements (on the 
left) offer an improvement of two orders of magnitude in the condition num- 
ber over the poor-performing counterparts (on the right). 


13.3 Gappy Measurements: Minimize Condition Num- 
ber 


The preceding section illustrates that the placement of gappy measurements 
is critical for accurately reconstructing the POD solution. This suggests that a 
principled way to determine measurement locations is of great importance. In 
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Figure 13.7: Depiction of the 10 best and 10 worst random 20% measurement 
vectors P; considered in Figs. [13.5]and{13.6| The top panel shows that the best 
measurement vectors sample fairly uniformly across the domain x € [—4, 4| 
with Az = 0.1. In contrast, the worst randomly generated measurements (mid- 
dle panel) have large sampling gaps near the center of the domain, leading to 
a large condition number k(M). The bottom panel shows a bar chart of the 
best and worst values of the condition number. Note that with 20% sampling, 
there can be two orders of magnitude difference in the condition number, thus 
suggesting the importance of prescribing good measurement locations. 


what follows, we outline a method originally proposed by Willcox for as- 
sessing the gappy measurement locations. The method is based on minimizing 
the condition number «(M) in the placement process. As already shown, the 
condition number is a good proxy for evaluating the efficacy of the reconstruc- 
tion. Moreover, it is a measure that is independent of any specific function. 

The algorithm proposed is computationally costly, but it can be per- 
formed in an offline training stage. Once the sensor locations are determined, 
they can be used for online reconstruction. The algorithm is as follows: 


1. Place sensor k at each spatial location possible and evaluate the condition 
number «(M). Only points not already containing a sensor are consid- 
ered. 


2. Determine the spatial location that minimizes the condition number «(M). 
This spatial location is now the kth sensor location. 
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Figure 13.8: Depiction of the first four iterations of the gappy measurement 
location algorithm of Willcox [754]. The algorithm is applied to a 10-mode ex- 
pansion given by the Gauss—Hermite functions and discretized 
on the interval x € [—4,4] with Ax = 0.1. The top panel shows the condition 
number «(M) as a single sensor is considered at each of the 81 discrete val- 
ues xx. The first sensor minimizes the condition number (shown in red) at 3. 
A second sensor is now considered at all remaining 80 spatial locations, with 
the minimal condition number occurring at x52 (in red). Repeating this process 
gives x37 and x77 for the third and fourth sensor locations for iterations 3 and 4 
of the algorithm (highlighted in red). Once a location is selected for a sensor, it 
is no longer considered in future iterations. This is represented by a gap. 


3. Add sensor k + 1 and repeat the previous two steps. 


The algorithm is not optimal, nor is it guaranteed to be so. However, it works 
quite well in practice since sensor configurations with low condition number 
produce good reconstructions with the POD modes. 

We apply this algorithm to construct the gappy measurement matrix P. As 
before, the modal basis for our numerical experiments are the Gauss—Hermite 
functions defined by and (12.26). The gappy measurement matrix al- 
gorithm for constructing P is shown in Fig. [13.8] - specifically, the first four 
iterations of the scheme. Note that the algorithm outlined above sets down one 
sensor at a time, thus with the 10-POD-mode expansion, the system is under- 
determined until 10 sensors are placed. This gives condition numbers on the 
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order of 10'° for the first nine sensor placements. It also suggests that the first 
10 sensor locations may be generated from inaccurate calculations of the con- 
dition number. 

Using a 10-mode expansion of the Gauss—Hermite functions, we minimize 
the condition number and identify the first 20 sensor locations. Specifically, this 
provides a principled way of producing a measurement matrix P that allows 
for good reconstruction of the POD mode expansion with limited measure- 
ments. In addition to identifying the placement of the first 20 sensors, recon- 
struction of the example function given by is computed at each itera- 
tion of the routine. Note the use of the setdiff command, which removes the 
condition number minimizing sensor location from consideration in the next 
iteration. 

To evaluate the gappy sensor location algorithm, we track the condition 
number as a function of the number of iterations, up to 20 sensors. Additionally, 
at each iteration, a reconstruction of the test function is computed and a 
least-squares error evaluated. Figure [13.9|shows the progress of the algorithm 
as it evaluates the sensor locations for up to 20 sensors. By construction, the 
algorithm minimizes the condition number «(M) at each step of the iteration; 
thus, as sensors are added, the condition number steadily decreases (top panel 
of Fig. [13.9). Note that there is a significant decrease in the condition number 
once 10 sensors are selected, since the system is no longer under-determined 
with theoretically infinite condition number. The least-squares error for the re- 
construction of the test function follows the same general trend, but the 
error does not monotonically decrease like the condition number. The least- 
squares error also makes a significant improvement once 10 measurements are 
made. In general, if an r-mode POD expansion is to be considered, then reason- 
able results using the gappy reconstruction cannot be achieved until r sensors 
are placed. 

We now consider the placement of the sensors as a function of iteration in 
the bottom panel of Fig. Specifically, we depict when sensors are identi- 
fied in the iteration. The first sensor location is 723 followed by 252, £37, and 77, 
respectively. The process is continued until the first 20 sensors are identified. 
The pattern of sensors depicted is important, as it illustrates a fairly uniform 
sampling of the domain. Alternative schemes will be considered in the follow- 
ing. 

As a final illustration of the gappy algorithm, we consider the reconstruc- 
tion of the test function as the number of iterations (sensors) increases. 
As expected, the more sensors that are used in the gappy framework, the better 
the reconstruction is, especially if the sensors are placed in a principled way as 
outlined by Willcox [754]. Figure shows the reconstructed function with 
increasing iteration number. In the left panel, iterations 1-20 are shown with 
the z-axis set to illustrate the extremely poor reconstruction in the early stages 
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Figure 13.9: Condition number and least-squares error (logarithms) as a func- 
tion of the number of iterations in the gappy sensor placement algorithm. The 
log of the condition number log(«(M)) monotonically decreases, since this is 
being minimized at each iteration step. The log of the least-squares error in the 
reconstruction of the test function also shows a trend towards improve- 
ment as the number of sensors are increased. Once 10 sensors are placed, the 
system is of full rank and the condition number drops by orders of magnitude. 
The bottom panel shows the sensors as they turn on (black squares) over the 
first 20 iterations. The first measurement location is, for instance, at 723. 


of the iteration. The right panel highlights the reconstruction from iteration 9 
to 20, and on a more limited z-axis scale, where the reconstruction converges to 
the test function. The true test function is also shown in order to visualize the 
comparison. This illustrates in a tangible way the convergence of the iteration 
algorithm to the test solution with a principled placement of sensors. 


Proxy Measures to the Condition Number 


We end this section by considering alternative measures to the condition num- 
ber «(M). The computation of the condition number itself can be computation- 
ally expensive. Moreover, until r sensors are chosen in an r-POD-mode expan- 
sion, the condition number computation is itself numerically unstable. How- 
ever, it is clear what the condition-number-minimization algorithm is trying to 
achieve: make the measurement matrix M as near to the identity as possible. 
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test function 


Figure 13.10: Convergence of the reconstruction to the test function (13.11). The 
left panel shows iterations 1-20 and the significant reconstruction errors of the 
early iterations and limited number of sensors. Indeed, for the first nine iter- 
ations, the condition number and least-squares error are quite large since the 
system is not full rank. The right panel shows a zoom-in of the solution from 
iteration 9 to 20 where the convergence is clearly observed. Comparison in both 
panels can be made to the test function. 


This suggests the following alternative algorithm, which was also developed 


by Willcox [754]. 


1. Place sensor k at each spatial location possible and evaluate the differ- 
ence in the sum of the diagonal entries of the matrix M minus the sum 
of the off-diagonal components; call this «2(M). Only points not already 
containing a sensor are considered. 


2. Determine the spatial location that generates the maximum value of the 
above quantity. This spatial location is now the kth sensor location. 


3. Add sensor k + 1 and repeat the previous two steps. 


Modification of two lines of code can enact a new metric which circumvents 
the computation of the condition number. 

To evaluate this new gappy sensor location algorithm, we track the new 
proxy metric we are trying to maximize as a function of the number of itera- 
tions along with the least-squares error of our test function (13.11). In this case, 
up to 60 sensors are considered, since the convergence is slower than before. 
Figure[13.11|shows the progress of the algorithm as it evaluates the sensor lo- 
cations for up to 60 sensors. By construction, the algorithm maximizes the sum 
of the diagonals minus the sum of the off-diagonals at each step of the itera- 
tion; thus, as sensors are added, this measure steadily increases (top left panel 
of Fig. [13.11). The least-squares error for the reconstruction of the test function 
decreases, but not monotonically. Further, the convergence is very slow. 
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Figure 13.11: Sum of diagonals minus off-diagonals (top left) and least-squares 
error (logarithm) as a function of the number of iterations in the second 
gappy sensor placement algorithm. The new proxy metric for condition num- 
ber monotonically increases, since this is being maximized at each iteration 
step. The log of the least-squares error in the reconstruction of the test func- 
tion (13.11) shows a trend towards improvement as the number of sensors is 
increased, but convergence is extremely slow in comparison to minimizing the 
condition number. The right panel shows the sensors as they turn on (black 
squares) over the first 60 iterations. The first measurement location is, for in- 
stance, at 137. 


At least for this example, the method does not work as well as the condition 
number metric. However, it can improve performance in certain cases [754], 
and it is much more computationally efficient to compute. 

As before, we also consider the placement of the sensors as a function of it- 
eration in the right panel of Fig. Specifically, we depict the turning on 
process of the sensors. The first sensor location is x37 followed by 233, £36, 
and 23), respectively. The process is continued until the first 60 sensors are 
turned on. The pattern of sensors depicted is significantly different than in the 
condition-number-minimization algorithm. Indeed, this algorithm, and with 
these modes, turns on sensors in local locations without sampling uniformly 
from the domain. 


13.4 Gappy Measurements: Maximal Variance 


The previous section developed principled ways to determine the location of 
sensors for gappy POD measurements. This was a significant improvement 
over simply choosing sensor locations randomly. Indeed, the minimization of 
the condition number through location selection performed quite well, quickly 
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improving accuracy and least-squares reconstruction error. The drawback to 
the proposed method was two-fold: Firstly, the algorithm itself is expensive to 
implement, requiring a computation of the condition number for every sen- 
sor location selected under an exhaustive search. Secondly, the algorithm was 
ill-conditioned until the rth sensor was chosen in an r-POD-mode expansion. 
Thus the condition number was theoretically infinite, but on the order of 1017 
for computational purposes. 

Karniadakis and co-workers proposed an alternative to the Willcox 
algorithm to overcome the computational issues outlined. Specifically, in- 
stead of placing one sensor at a time, the new algorithm places r sensors, for an 
r-POD-mode expansion, at the first step of the iteration. Thus the matrix gener- 
ated is no longer ill-conditioned with a theoretically infinite condition number. 

The algorithm by Karniadakis further proposes a principled way to select 
the original r sensor locations. This method selects locations that are extrema 
points of the POD modes, which are designed to maximally capture variance 
in the data. Specifically, the following algorithm is suggested: 


1. Place r sensors initially. 


2. Determine the spatial locations of these first r sensors by considering the 
maximum of each of the POD modes Y}. 


3. Add additional sensors at the next largest extrema of the POD modes. 


The performance of this algorithm is not strong for only r measurements, but 
it at least produces stable condition number calculations. To improve perfor- 
mance, one could also use the minimum of each of the modes w,. Thus the 
maximal value and minimal value of variance are considered. For the harmonic 
oscillator code, the first mode produces no minimum, as the minima are at 
T= choo: 

More generally, the Karniadakis algorithm advocates randomly select- 
ing p sensors from M potential extrema, and then modifying the search posi- 
tions with the goal of improving the condition number. In this case, one must 
identify all the maxima and minima of the POD modes in order to make the 
selection. The harmonic oscillator modes and their maxima and minima are 
illustrated in Fig. 

In this example, there are 55 possible extrema. This computation assumes 
the data is sufficiently smooth so that extrema are simply found by considering 
neighboring points, i.e., a maximum exists if its two neighbors have a lower 
value, whereas a minimum exists if its neighbors have a higher value. 

The maximal-variance algorithm suggests trying different configurations of 
the sensors at the extrema points. In particular, if 20 gappy measurements are 
desired, then we would need to search through various configurations of the 55 
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Figure 13.12: The top panel shows the mode structures of the Gauss—Hermite 
polynomials © in the low-rank approximation of a POD expansion. The dis- 
cretization interval is x € [—4,4] with a spacing of Ax = 0.1. The color map 
shows the maximum (white) and minimum (black) that occur in the mode 
structures. The bottom panel shows the grid cells corresponding to maxima 
and minima (extrema) of POD mode variance. The extrema are candidates for 
sensor locations, or the measurement matrix P, since they represent maximal 
variance locations. Typically, one would take a random subsample of these ex- 
trema to begin the evaluation of the gappy placement. 


locations using 20 sensors. This combinatorial search is intractable. However, if 
we simply attempt 100 random trials and select the best-performing configura- 
tion, it is quite close to the performance of the condition-number-minimization 
algorithm. A full execution of this algorithm, along with a computation of the 
condition number and least-squares fit error with (13.11), is generated. The con- 
dition number and least-squares error for the 100 trials are shown in Fig. 
The configurations perform well compared with random measurements, al- 
though some have excellent performance. 

A direct comparison of all these methods is shown in Fig. Specifi- 
cally, what is illustrated are the results from using (a) the maximum locations 
of the POD modes, (b) the maximum and minimum locations of each POD 
mode, and (c) a random selection of 20 of the 55 extremum locations of the 
POD modes. These are compared against (d) the best five sensor placement 
locations of 20 sensors selected from the extremum over 100 random trials, 
and (e) the condition-number-minimization algorithm (in red). The maximal- 
variance algorithm performs approximately as well as the condition-number- 
minimization algorithm. However, the algorithm is faster and never computes 
condition numbers on ill-conditioned matrices. Karniadakis and co-workers 
also suggest innovations on this basic implementation. Specifically, it is 
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Figure 13.13: Condition number and least-squares error to test function 
over 100 random trials that draw 20 sensor locations from the possible 55 ex- 
trema depicted in Fig. The 100 trials produce a number of sensor config- 
urations that perform close to the level of the condition-number-minimization 
algorithm of the last section. However, the computational costs in generating 
such trials can be significantly lower. 


suggested that one consider each sensor, one-by-one, and try placing it in all 
other available spatial locations. If the condition number is reduced, the sensor 
is moved to that new location and the next sensor is considered. 


13.5 POD and the Discrete Empirical Interpolation 
Method (DEIM) 


The POD method illustrated thus far aims to exploit the underlying low-dimensional 
dynamics observed in many high-dimensional computations. POD is often used 
for reduced-order models (ROMs), which are of growing importance in scien- 
tific applications and computing. ROMs reduce the computational complexity 
and time needed to solve large-scale, complex systems [24}|75]/325) [578]. Specifi- 
cally, ROMs provide a principled approach to approximating high-dimensional 
spatio-temporal systems [185], typically generated from numerical discretiza- 
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Figure 13.14: Performance metrics for placing sensors based upon the extrema 
of the variance of the POD modes. Both the least-squares error for the recon- 
struction of the test function (13.11) and the condition number are considered. 
Illustrated are the results from using (a) the maximum locations of the POD 
modes, (b) the maximum and minimum locations of each POD mode, and 
(c) a random selection of 20 of the 55 extremum locations of the POD modes. 
These are compared against (d) the five top selections of 20 sensors from the 
100 random trials, and (e) the condition-number-minimization algorithm (red 
bar). The random placement of sensors from the extremum locations provides 
performance close to that of the condition number minimization without the 
same high computational costs. 


tion, by low-dimensional subspaces that produce nearly identical input/out- 
put characteristics of the underlying nonlinear dynamical system. However, 
despite the significant reduction in dimensionality with a POD basis, the com- 
plexity of evaluating higher-order nonlinear terms may remain as challenging 
as the original problem [55}/171]. The empirical interpolation method (EIM) and 
the simplified discrete empirical interpolation method (DEIM) for the proper 
orthogonal decomposition (POD) overcome this difficulty by provid- 
ing a computationally efficient method for discretely (sparsely) sampling and 
evaluating the nonlinearity. These methods ensure that the computational com- 
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plexity of ROMs scale favorably with the rank of the approximation, even with 
complex nonlinearities. 

EIM has been developed for the purpose of efficiently managing the com- 
putation of the nonlinearity in dimensionality reduction schemes, with DEIM 
specifically tailored to POD with Galerkin projection. Indeed, DEIM approxi- 
mates the nonlinearity by using a small, discrete sampling of points that are 
determined in an algorithmic way. This ensures that the computational cost of 
evaluating the nonlinearity scales with the rank of the reduced POD basis. As 
an example, consider the case of an r-mode Galerkin—POD truncation. A sim- 
ple cubic nonlinearity requires that the Galerkin—POD approximation be cubed, 
resulting in r° operations to evaluate the nonlinear term. DEIM approximates 
the cubic nonlinearity by using O(r) discrete sample points of the nonlinearity, 
thus preserving a low-dimensional (O(7)) computation, as desired. The DEIM 
approach combines projection with interpolation. Specifically, DEIM uses se- 
lected interpolation indices to specify an interpolation-based projection for a 
nearly %2 optimal subspace approximating the nonlinearity. EIM/DEIM are not 
the only methods developed to reduce the complexity of evaluating nonlinear 
terms; see for instance the missing point estimation (MPE) or gappy 
POD methods. However, they have been successful in a 
large number of diverse applications and models [171]. In any case, the MPE, 
gappy POD, and EIM/DEIM use a small selected set of spatial grid points to 
avoid evaluation of the expensive inner products required to evaluate nonlin- 
ear terms. 


POD and DEIM 


Consider a high-dimensional system of nonlinear differential equations that 
can arise, for example, from the finite-difference discretization of a partial dif- 
ferential equation. In addition to constructing a snapshot matrix of the 
solution of the PDE so that POD modes can be extracted, the DEIM algorithm 
also constructs a snapshot matrix of the nonlinear term of the PDE: 


N= Ni Ng we Nels (13.12) 


where the columns N, € C” are evaluations of the nonlinearity at time ty. 

To achieve high-accuracy solutions, n is typically very large, making the 
computation of the solution expensive and/or intractable. The Galerkin-POD 
method is a principled dimensionality reduction scheme that approximates the 
function u(t) with rank-r optimal basis functions, where r < n. As shown in 
the previous chapter, these optimal basis functions are computed from a sin- 
gular value decomposition of a series of temporal snapshots of the complex 
system. 
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The standard POD procedure is a ubiquitous algorithm in the reduced- 
order modeling community. However, it also helps illustrate the need for in- 
novations such as DEIM, gappy POD, and/or MPE. Consider the nonlinear 
component of the low-dimensional evolution (12.21): W"N(Wa(t)). For a sim- 
ple nonlinearity such as N (u(x, t)) = u(x, t)’, consider its impact on a spatially 
discretized, two-mode POD expansion: u(x,t) = ai(t)W1(x) + ao(t)wWo(x). The 
algorithm for computing the nonlinearity requires the evaluation: 


u(x,t)? = ayy} + 3aiaapi pa + 8arazbiyy + a343. (13.13) 


The dynamics of a;(t) and a2(t) would then be computed by projecting onto the 
low-dimensional basis by taking the inner product of this nonlinear term with 
respect to both y~ and %2. Thus not only does the number of computations 
double, but also the inner products must be computed with the n-dimensional 
vectors. Methods such as DEIM overcome this high-dimensional computation. 
Figure[13.15|gives an overview of the algorithm that is detailed below. 


DEIM 


As outlined in the previous section, the shortcomings of the Galerkin-POD 
method are generally due to the evaluation of the nonlinear term N(Wa(t)). 
To avoid this difficulty, DEIM approximates N(Wa(t)) through projection and 
interpolation instead of evaluating it directly. Specifically, a low-rank represen- 
tation of the nonlinearity is computed from the singular value decomposition, 


N = BUNV‘, (13.14) 


where the matrix = contains the optimal basis for spanning the nonlinearity. 
Specifically, we consider the rank-p basis 


=, = Si éz £] (13.15) 


that approximates the nonlinear function (p « nand p ~ r). The approximation 
to the nonlinearity N is given by 


N ote (13.16) 


where c(t) is similar to a(t . Since this is a highly over-determined 
system, a suitable vector ) ? can e found by selecting p rows of the system. 
The DEIM algorithm was developed to identify which p rows to evaluate. 

The DEIM algorithm begins by considering the vectors e,, € R”, which are 
the y;th column of the n-dimensional identity matrix. We can then construct 
the projection matrix P = [e,, e,, +++ €], which is chosen so that PTE, is 
non-singular. Then c(t) is uniquely defined from PTN = PTZ,c(t), and thus 


N x =, (P7=,) PTN. (13.17) 
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Figure 13.15: Demonstration of the first three iterations of the DEIM algorithm. 
For illustration only, the nonlinearity matrix N = 2UNVjx is assumed to be 
composed of harmonic oscillator modes with the first 10 modes comprising =p. 
The initial measurement location is chosen at the maximum of the first mode 
€,. Afterwards, there is a three-step process for selecting subsequent measure- 
ment locations based upon the location of the maximum of the residual vector 
R,. The first (red), second (green), and third (blue) measurement locations are 
shown along with the construction of the sampling matrix P. 


The tremendous advantage of this result for nonlinear model reduction is that 
the term PTN requires evaluation of the nonlinearity only at p < n indices. 
DEIM further proposes a principled method for choosing the basis vectors &; 
and indices 7;. The DEIM algorithm, which is based on a greedy search, is de- 
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Table 13.1: DEIM algorithm for finding approximation basis for the nonlinear- 
ity and its interpolation indices. The algorithm first constructs the nonlinear 
basis modes and initializes the first measurement location, and the matrix P4, 
as the maximum of €. The algorithm then successively constructs columns of 
P; by considering the location of the maximum of the residual R}. 


DEIM algorithm 

Basis construction and initialization 
collect data, construct snapshot matrix X = [u(t,) u(te) --- u(tm)] 
construct nonlinear snapshot matrix N = [N(u(t1)) N(u(te)) --- N(u(tm))] 
singular value decomposition of N N = ZAN VA 
construct rank-p approximating basis By = [éi £9 Ep] 
choose the first index (initialization) le, yı] = max || 
construct first measurement matrix Pı = [e] 

Interpolation indices and iteration loop (j = 2,3,..., p) 
calculate c; PTE ,c; = Pleja 
compute residual Rj+1 = §)41 — EjCj 
find index of maximum residual [?, 73] = max |Rj+1| 
add new column to measurement matrix | Pj+1 = [Pj e),| 


tailed in and further demonstrated in Table[13.1| 

POD and DEIM provide a number of advantages for nonlinear model re- 
duction of complex systems. POD provides a principled way to construct an 
r-dimensional subspace W characterizing the dynamics. DEIM augments POD 
by providing a method to evaluate the problematic nonlinear terms using a p- 
dimensional subspace =, that represents the nonlinearity. Thus a small number 
of points can be sampled to approximate the nonlinear terms in the ROM. 


13.6 DEIM Algorithm Implementation 


To demonstrate model reduction with DEIM, we again consider the NLS equa- 
tion (12.29). Specifically, the data set considered is a matrix whose rows repre- 
sent the time snapshots and whose columns represent the spatial discretization 
points. As in the first section of this chapter, our first step is to transpose this 
data so that the time snapshots are columns instead of rows. The following 
code transposes the data and also performs a singular value decomposition to 
get the POD modes. 


Code 13.1: [MATLAB] Dimensionality reduction for NLS. 


X=usol.’; © data matrix xX 
[U,S,W]=svd(X,0); %7 SVD reduction 
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Code 13.1: [Python] Dimensionality reduction for NLS. 


X = usol.T # data matrix X 
U,S,;Wr = np. linalg.svd (xX, full_matrices=0) # SVD reduction 


In addition to the standard POD modes, the singular value decomposition 
of the nonlinear term is also required for the DEIM algorithm. This computes 
the low-rank representation of N(u) = |u|?u directly as N = EEn VÄ. 


Code 13.2: [MATLAB] Dimensionality reduction for nonlinearity of NLS. 


NL=ix (abs (X) .72) .*X; 
[XI,S_NL,W]=svd (NL, 0); 


Code 13.2: [Python] Dimensionality reduction for nonlinearity of NLS. 


NL = (13) *np.power (np.abs (X),2) *X 
XI,S_NL,WT = np.linalg.svd(NL, full_matrices=0) 


Once the low-rank structures are computed, the rank of the system is chosen 
with the parameter r. In what follows, we choose r = p = 3 so that both the 
standard POD modes and nonlinear modes, Y and &,, have three columns 
each. The following code selects the POD modes for Y and projects the initial 
condition onto the POD subspace. 


Code 13.3: [MATLAB] Rank selection and POD modes. 


r=3; % select rank truncation 
Psi=U(:,1l:r); % select POD modes 
a- Poan 7.0; 2 pProjece Imitetal Condi Eons 


Code 13.3: [Python] Rank selection and POD modes. 


r= 3 # select rank truncation 
Psi = U[:,:r] # select POD modes 
a0 = Psi. T Q u0 7 project initial conditions 


We now build the interpolation matrix P by executing the DEIM algorithm 
outlined in the last section. The algorithm starts by selecting the first interpola- 
tion point from the maximum of the first most dominant mode of &,,. 


Code 13.4: [MATLAB] First DEIM point. 


[Xi_max, nmax] =max (abs (XI (:,1))); 
KTOS E D 

z=zeros (n, 1); 

P=z; P(nmax)=1; 


Code 13.4: [Python] First DEIM point. 


nmax = np.argmax(np.abs (XI[:,0])) 
XI m XT 0l- reshape (n71) 
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Z = np.zeros((n,1)) 
P = npecopy (z) 
P[nmax] = 1 


The algorithm iteratively builds P one column at a time. The next step of 
the algorithm is to compute the second to rth interpolation point via the greedy 
DEIM algorithm. Specifically, the vector c; is computed from PTS j= PTE j4 
where €; are the columns of the nonlinear POD modes matrix =,. The ac- 
tual interpolation point comes from looking for the maximum of the residual 
Rj+ı = €)41 — Z;c;. Each iteration of the algorithm produces another column 
of the sparse interpolation matrix P. The integers nmax give the location of the 
interpolation points. 


Code 13.5: [MATLAB] DEIM points 2 through r. 
for j=2:r 
ESERIA om) CB roe N), 
reS- XIIIo) XI mre; 
[Xi_max, nmax] =max (abs (res)); 
Xi Mx m, XI: D], 
P=[P,z]; P(nmax, j)=1; 


end 


Code 13.5: [Python] DEIM points 2 through r. 


for jj in range(1,r): 
e=np.linalg.solve(P.1@Xi_m, P.T@Xi[?,74)]).reshape (n, 1) ) 


res = Xl [2,4] sreshape (n, l) — xXi-m @ c 

nmax = np.argmax(np.abs (res) ) 

XI_m=np.concatenate((XI_m, XI[:,j]].reshape(n,1)),axis=1) 
P = np.concatenate((P,z),axis=1) 

P[nmax, jj] = 1 


With the interpolation matrix, we are ready to construct the ROM. The first 
part is to construct the linear term ©7LW of where the linear opera- 
tor for NLS is the Laplacian. The derivatives are computed using the Fourier 
transform. 


Code 13.6: [MATLAB] Projection of linear terms. 
for j—l:r s linear derivative terms 
T (6-2, D EEE (Slee eae ee tena (C N, 
end 
L= (1/2)+(Psi')+Lxx; 2 projected lincar term 


Code 13.6: [Python] Projection of linear terms. 


Lxx = np.zeros((n,r),dtype=’ complex_’ ) 
for jj in range(r): 
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L= 0.5 = (13) * Psi.T @ Lxx # projected linear term 


The projection of the nonlinearity is accomplished using the interpolation 


matrix P with the formula (13.17). Recall that the nonlinear term in (12.21) is 
multiplied by ©". Also computed is the interpolated version of the low-rank 


subspace spanned by W. 
Code 13.7: [MATLAB] Projection of nonlinear terms. 


P_NL=Psi’x( XI_mt«inv(P’*XI_m) ); >% nonlinear projection 
P PSI- PAPST S me moO Lain ron (Oils RIT 


Code 13.7: [Python] Projection of nonlinear terms. 


P NE PS TRUT M e np. inalog. any (ET A ATEM) 
POPSI PETA Poi mant erpo lat non (one PRST 


It only remains now to advance the solution in time using a numerical time- 
stepper. This is done with a fourth-order Runge-Kutta routine. 


Code 13.8: [MATLAB] Time-stepping of ROM. 


[tt,a]=0de45 ("rom deim rhs’ ,t,a, [], ,P_NL,P_Psi; L); 
Xtilde=Psixa’; % DEIM approximation 
waterfall (x,t,abs(Xtilde’')), shading interp, colormap gray 


Code 13.8: [Python] Time-stepping of ROM. 


a0_split = np.concatenate((np.real(a0),np.imag(a0))) # 
Separate real/complex pieces 

a split = integrate.odeint (rom deim rhs, al _split,t,mxstep 
=10**6) 

ac a ollie Sel) ene (bsp actos ollabin [eres 

Xtilde = Psi @ a.T # DEIM approximation 


The right-hand side of the time-stepper is now completely low-dimensional. 


Code 13.9: [MATLAB] Right-hand side of ROM. 
function rhs=rom_deim_rhs(tspan, a,dummy,P_NL,P_Psi,L) 
N=P_Psita; 
rhs=L*a + i*P_NL«((abs(N).72).*N); 


Code 13.9: [Python] Right-hand side of ROM. 
def rom deim rhs(a split, tspan,P NL=P Ni, Po Psi P Psa, LL): 
a= a spliti:r] + {1J -a splitir:] 
N = P_Psi @ a 
rhs = L @ a + (13) * P_NL @ (np.power (np.abs (N) ,2) *N) 
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Figure 13.16: Comparison of the (a) full simulation dynamics and (b) rank r = 3 
ROM using the three DEIM interpolation points. (c) A detail of the three POD 
modes used for simulation is shown along with the first, second, and third 
DEIM interpolation point locations. These three interpolation points are capa- 
ble of accurately reproducing the evolution dynamics of the full PDE system. 


rhs split = np.concatenate((np.real(rhs),np.imag(rhs) )) 
return rhs_split 


A comparison of the full simulation dynamics and rank r = 3 ROM using 
the three DEIM interpolation points is shown in Fig. Additionally, the 
location of the DEIM points relative to the POD modes is shown. Aside from 
the first DEIM point, the other locations are not on the minima or maxima of 
the POD modes. Rather, the algorithm places them to maximize the residual. 


QDEIM Algorithm 


Although DEIM is an efficient greedy algorithm for selecting interpolation points, 
there are other techniques that are equally efficient. The recently proposed QDEIM 
algorithm leverages the OR decomposition to provide efficient, greedy 
interpolation locations. This has been shown to be a robust mathematical ar- 
chitecture for sensor placement in many applications [481]. See Section 3.8] for 

a more general discussion. The QR decomposition can also provide a greedy 
strategy to identify interpolation points. In QDEIM, the QR pivot locations are 
the sensor locations. The following code can replace the DEIM algorithm to 
produce the interpolation matrix P. 
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Code 13.10: [MATLAB] QR-based interpolation points. 


[Q,R,pivot]=qr(NL.’); 
Popivot (sep lesa) 


Code 13.10: [Python] QR-based interpolation points. 
Q,R,pivot = gr (NET, pivoting=True) 
P_gr = np.zeros_like (x) 
igepeljesbideie (| Ger] = ab 


Using this interpolation matrix gives identical interpolation locations as shown 
in Fig. 13.16] More generally, there are estimates that show that the QDEIM may 
improve error performance over standard DEIM [215]. The ease of use of the 
OR algorithm makes this an attractive method for sparse interpolation. 


13.7 Decoder Networks for Interpolation 


The gappy interpolation methods presented thus far are all based upon lin- 
ear mappings between the measurement space and the full-state reconstruc- 
tion. Equation provides a mathematical representation of this mapping, 
which dictates how measurements in an r-dimensional (low-rank) space can be 
related to the original n-dimensional (high-dimensional) state space. The focus 
thus far is in leveraging SVD modes W for reconstruction tasks. Specifically, if 
the original state space is represented in the low-dimensional subspace so that 
u = Wa, then the suite of gappy interpolation methods can be executed in order 
to approximate the high-fidelity solution. 

To be more precise, recall that, in the gappy POD formulation, the measure- 
ment matrix specifies the interpolation to be used: 


ù = Pu x PWa, (13.18) 


where the state vector is expressed in terms of POD modes in the second ap- 
proximation. Given measurements ŭ along with a measurement matrix P and 
POD modes W, the coefficients for reconstruction can now be computed by 
least-squares: 

a= (PW)'a, (13.19) 


where t is the Moore-Penrose pseudo-inverse. In terms of an optimization prob- 
lem, this is alternatively formulated as 


a € argmin |/a — PWal|5. (13.20) 
This is the standard POD reconstruction error formulation. This is revisited 
here since the optimization formulation can be improved to stabilize POD re- 


constructions. Specifically, just like neural networks, additional regularizations 
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can be applied in order to ensure a more stable solution. The improved formu- 
lation adds an elastic net [777] regularization, which is a combination of ¢; and 
l> penalties: 


a € argmin ||ù — P Pål + Aillãllı + A2|lall5, (13.21) 


where 4,2 are hyperparameters that control the 41- and /2-norms, respectively. 
In what follows, this is referred to as POD PLUS, since it is an augmentation of 
standard POD. As will be shown, such a regularization improves the standard 
linear mapping from measurements to the state space. 

Instead of linear mappings between measurements and state space, we lever- 
age the universal approximation properties of neural networks to construct 
nonlinear mappings between measurements and state space. Specifically, we 
construct a decoder neural network so that 


û = fo(a), (13.22) 


where û is an approximation to the full state u and fg(-) is a decoder neural 
network. The optimization procedure evaluates the expression 


N 
argmin ` lu; — fe(a,;)||5 (13.23) 


j=1 


with N training data pairs {u;,u;} for 7 = 1,2,..., N. This supervised algo- 
rithm uses the N sample pairs from measurement ù; to its corresponding full 
state-space representation u; in order to build a nonlinear mapping between 
them. 

Figure[13.17]shows the architecture of the decoder mapping from measure- 
ments to the state space. The only thing that needs to be determined is the num- 
ber of layers and their widths along with the activation functions. Erichson et 
al. showed that a shallow decoder, in which only a few layers were used, 
provides an effective nonlinear mapping while using only modest amounts 
of training data. In addition, the optimization was modified to regularize the 
weights so that 


N 
argmin ) | lu; — fo (ŭ;)ll2 + AllOll3, (13.24) 
j=l 


where A is a hyperparameter that determines the strength of the norm (42) reg- 
ularization. The Adam optimization algorithm was used to train the shal- 
low decoder. Various hyperparameters can be fine-tuned in practice, but the 
choice of parameters used worked well in practice for several physics-related 
examples. 
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Figure 13.17: Decoder network providing a nonlinear map from measurements 
to the high-dimensional state space: i = fg(tu). The decoder trains on data 
pairs {u;,u;} for j = 1,2,...,N. The network is implemented in Python 
using PyTorch; research code for flow behind the cylinder is available via 
https://github.com/erichson/ShallowDecoder 


Modal Comparison: POD versus Shallow Decoder 


ROMs exploit low-rank features of the data, which are often interpreted as 
modes that characterize physical processes. POD modes provide optimal 
representations in an l sense. However, this does not guarantee that they are 
the best modes in a broader sense when considering noisy and dynamic data. 
Indeed, POD modes can be easily corrupted by outliers and noise so that they 
are compromised in producing accurate reconstructions of the underlying high- 
dimensional data from which they are extracted. 

The shallow decoder network highlighted above also produces modal struc- 
tures. In contrast with POD modes, which can be linearly superimposed to pro- 
duce an approximation, the decoder network is a nonlinear transformation and 
linear superposition does not hold. Figure [13.18] shows the contrasting domi- 
nant modal structures that are generated from the flow around a cylinder exam- 
ple. Note that for POD modes, the dominant modes alternate between symmet- 
ric and antisymmetric modes in the vertical direction. Linearly superimposing 
these modes in time generates the canonical dynamics of von Kármán vortex 
shedding. In contrast, the shallow decoder modes are not symmetric. Rather, 
their shapes are very much like what is observed in the fluid flows, i.e., the 
modes look like snapshots of the fluid itself. The modes are not orthogonal, yet 
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Figure 13.18: Dominant modes learned by the shallow decoder in comparison 
with dominant POD modes. The modal features show that the shallow decoder 
network constructs a reasonable characterization of the flow behind a cylinder 
using very different modal structures. Indeed, by not constraining the modes 
to be linear and orthogonal, as is enforced with POD, a potentially more inter- 
pretable feature space can be extracted from data. Such modes can be exploited 
for reconstruction of the state space from limited measurements and limited 
data. From Erichson et al. [235]. 


they are used in the nonlinear shallow encoder to reconstruct the fluid dynam- 
ics. 

There is more than just a distinct difference in modal profiles between the 
POD and shallow decoder. The robustness of the linear versus nonlinear encod- 
ing strategies is remarkably different. To characterize the robustness and flexi- 
bility of the shallow decoder, we consider flow reconstruction in the presence 
of additive white noise. In practical experimental settings, noisy measurements 
are common and can have significant impact on building ROMs. Figure 
shows the difference between the POD and POD PLUS methods in contrast 
to the shallow decoder. The shallow decoder shows a clear advantage and a 
de-noising effect. Indeed, the reconstructed snapshots allow for a meaningful 
interpretation of the underlying structure while also being highly robust. Inter- 
estingly, POD PLUS also significantly outperforms the standard gappy meth- 
ods typically used in ROMs, while still maintaining linear superposition. Thus 
POD PLUS offers a hybrid method where performance is increased while re- 
taining the advantageous features of linearity. But, overall, the shallow decoder 
shows that a neural network model fọ(-) can provide significant performance 
gains. 
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Figure 13.19: Reconstruction results for the flow around the cylinder with noise. 
For this simulation, the signal-to-noise ratio is 10. In (a) the target snapshot and 
the corresponding sensor configuration (using 10 sensors) is shown. Both POD 
and POD PLUS are not able to reconstruct the flow field, as shown in (b) and 
(c). The shallow decoder is able to reconstruct the coherent structure of the flow 
field, as shown in (d). From Erichson et al. [235]. 


13.8 Randomization and Compression for ROMs 


This chapter has been largely concerned with the interpolation problem associ- 
ated with ROMs. Specifically, how does one construct a ROM without recourse 
to the high-dimensional state. Gappy interpolation techniques aim to construct 
ROMs and compute nonlinear terms in PDEs in an efficient manner. Specifi- 
cally, we recall that we are interested in building ROMs for (12.13). Assuming 
a solution ansatz u = Wa allows for the construction of POD and DMD ROMs 
for the evolution dynamics of a(t). Specifically, we have the following ROM 
models: 


da 


a= w'LWa+ ©'N(Wa, 3) (POD), (13.25a) 
da = W'LWa+ YP exp(Nt)b (POD-DMD). (13.25b) 


The computational bottleneck addressed in this chapter is the repeated evalua- 
tion of the nonlinear term Y7 N(Wa, 3), which is done using a gappy (e.g., the 
DEIM or Q-DEIM) methods. The greedy algorithms outlined in the preceding 
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sections highlight how a small number of measurements can be used with the 
low-rank modes W to accomplish this task of evaluating inner products repeat- 
edly. 

Ignored overall in the ROM formulation is the offline cost of producing low- 
rank embeddings, manifest here by Y and ®. The high-fidelity simulation data 
X € C™™ used to produce ROMs is often exceptionally high-dimensional. 
Thus the cost of producing a full or economy SVD is large when n,m > 1. This 
often is not a problem if only a single SVD needs to be performed, but often the 
ROM needs to be updated, and recourse to the original high-dimensional sys- 
tem is required in order to update } and ®. To avoid costly re-computations, 
randomized linear algebra and compressive sampling 
techniques can be used for enhanced computational efficiency. 

The main idea is to consider basis functions Y not from the full set of mea- 
surements but from a few spatially incoherent measurements. The measure- 
ment matrix C € R?’*™, which was originally used as a matrix for defining 
gappy interpolation points, is now used to characterize the random measure- 
ments of the system and produce the compressed matrix Z € R?*” such that 


Z = CX. 


Here, we consider sparse measurements of the snapshots matrix in order to 
compute POD and DMD from this new compressed snapshot matrix. To start, 
it is assumed that the snapshot matrix X is almost square, e.g., n ~ m, and one 
can imagine this is a realistic situation working with an explicit time scheme 
or in a many-query context. Section |1.8|shows that the smaller matrix is now 
decomposed using QR so that Z = QR. The original data is then projected onto 
the QR basis Y = Q*X and the SVD of the much smaller matrix is computed, 
Y = Uyvy®£V*. This allows one to transform the low-rank matrix Uy back to 
the original coordinates and approximate the POD basis ¥ = QUy. This pro- 
vides a computationally efficient method for extracting the POD modes. ROMs 
constructed from randomized POD methods have now been investigated by 
several groups O41), with all them demonstrating the computational 
performance advantages gained by randomization. 

The randomized architecture can also be used to produce DMD approxima- 
tions represented by ® [135]. Thus, instead of performing the DMD algorithm 
on snapshot pairs associated with X, DMD is instead performed on the com- 
pressively sampled matrices 

Z' = AzZ, (13.26) 


where Z = |Z; Z +++ Zm] and Z’ = |z| z% --- z% ] asin Section [7.2] The DMD 
algorithm computes the DMD eigenvalues and DMD modes (Az, ®z) along 
with the eigenvectors matrix Wz of the similarity matrix Az. The DMD eigen- 
values are self-similar so that A ~ Az and the DMD modes are given by 
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® = X'Vz&7 Wz. Thus the computation avoids an expensive SVD by per- 
forming the SVD in the low-rank subspace of Z. Like POD-based randomiza- 
tion, DMD with randomization provides a scalable architecture for building 
ROMs, as demonstrated by a number of groups 

Overall, randomized techniques are a promising approach to circumvent 
expensive offline computations in model order reduction. In particular, when 
dealing with large snapshot matrices, there is now abundant evidence to sug- 
gest the use of randomized SVD methods for POD- and DMD-based decompo- 
sitions. They both provide very accurate solutions and promise significant com- 
putational savings in the offline stage, which turns out to be the most expensive 
part of constructing the surrogate model. Indeed, the rapid computation of U 
and ® through such techniques can greatly aid in solving the surrogate mod- 


els (13.25) 


13.9 Machine Learning ROMs 


Inspired by machine learning methods, the various POD bases for a parame- 
terized system are merged into a master library of POD modes ¥z which con- 
tains all the low-rank subspaces exhibited by the dynamical system. This lever- 
ages the fact that POD provides a principled way to construct an r-dimensional 
subspace W, characterizing the dynamics while sparse sampling augments the 
POD method by providing a method to evaluate the problematic nonlinear 
terms using a p-dimensional subspace projection matrix P. Thus a small num- 
ber of points can be sampled to approximate the nonlinear terms in the ROM. 
Figure illustrates the library building procedure whereby a dynamical 
regime is sampled in order to construct an appropriate POD basis Y. 

The method introduced here capitalizes on these methods by building low- 
dimensional libraries associated with the full nonlinear system dynamics as 
well as the specific nonlinearities. Interpolation points, as will be shown in 
what follows, can be used with sparse representation and compressive sens- 
ing to (i) identify dynamical regimes, (ii) reconstruct the full state of the system, 
and (iii) provide an efficient nonlinear model reduction and Galerkin—POD pre- 
diction for the future state. 

The concept of library building of low-rank features from data is well estab- 
lished in the computer science community. In the reduced-order modeling com- 
munity, it has recently become an enabling computational strategy for paramet- 
ric systems. Indeed, a variety of recent works have produced libraries of ROM 
models that can be selected and/or inter- 
polated through measurement and classification. Alternatively, cluster-based 
reduced-order models use a k-means clustering to build a Markov transition 
model between dynamical states [367]. These recent innovations are similar to 
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Figure 13.20: Library construction from numerical simulations of the governing 
equations (12.1). Simulations are performed of the parameterized system for 
different values of a bifurcation parameter ju. For each regime, low-dimensional 
POD modes W,. are computed via an SVD decomposition. The various rank-r 
truncated subspaces are stored in the library of modes matrix Yz. This is the 
learning stage of the algorithm. Reproduced from Kutz et al. [424]. 


the ideas advocated here. However, our focus is on determining how a suitably 
chosen P can be used across all the libraries for POD mode selection and recon- 
struction. One can also build two sets of libraries: one for the full dynamics and 
a second for the nonlinearity so as to make it computationally efficient with the 
DEIM strategy [619]. Before these more formal techniques based on machine 
learning were developed, it was already realized that parameter domains could 
be decomposed into subdomains and a local ROM/POD computed in each sub- 
domain. Patera and co-workers used a partitioning based on a binary tree, 
whereas Amsallem et al. used a Voronoi tessellation of the domain. Such 
methods were closely related to the work of Du and Gunzburger where 
the data snapshots were partitioned into subsets and multiple reduced bases 
computed. The multiple bases were then recombined into a single basis, so it 
does not lead to a library, per se. For a review of these domain partitioning 
strategies, please see [17]. 


POD Mode Selection 


Although there are a number of techniques for selecting the correct POD li- 
brary elements to use, including the workhorse k-means clustering algorithm 
[555], one can also instead make use of sparse sampling and 
the sparse representation for classification (SRC) innovations outlined in Chap- 
ter[3|to characterize the nonlinear dynamical system [112\/136}|619]. Specifically, 
the goal is to use a limited number of sensors (interpolation points) to classify 
the dynamical regime of the system from a range of potential POD library ele- 
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ments characterized by a parameter 3. Once a correct classification is achieved, 
a standard £ reconstruction of the full state space can be accomplished with 
the selected subset of POD modes, and a Galerkin—POD prediction can be com- 
puted for its future. 

In general, we will have a sparse measurement vector ù given by (13.1). The 
full-state vector u can be approximated with the POD library modes (u = Wa), 
therefore 

a= PW,a, (13.27) 


where W, is the low-rank matrix whose columns are POD basis vectors con- 
catenated across all 6 regimes and c is the coefficient vector giving the projec- 
tion of u onto these POD modes. If PW, obeys the restricted isometry property 
and u is sufficiently sparse in WY, then it is possible to solve the highly under- 
determined system (13.27) with the sparsest vector a. Mathematically, this is 
equivalent to an lọ optimization problem, which is NP-hard. However, under 
certain conditions, a sparse solution of can be found (see Chapter|3) by 


minimizing the /,;-norm instead, so that 


c=argmin||a’||,; subjectto w= PW,a. (13.28) 


The last equation can be solved through standard convex optimization meth- 
ods. Thus the ¢;-norm is a proxy for sparsity. Note that we use the sparsity only 
for classification, not for reconstruction. Figure demonstrates the sparse 
sampling strategy and prototypical results for the sparse solution a. 


Example: Flow Around a Cylinder 


To demonstrate the sparse classification and reconstruction algorithm devel- 
oped, we consider the canonical problem of flow around a cylinder. This prob- 
lem is well understood and has already been the subject of studies concerning 
sparse spatial measurements (122) |136} (376) [491] (619) [732]. Specifically, it is 
known that, for low to moderate Reynolds numbers, the dynamics are spatially 
low-dimensional and POD approaches have been successful in quantifying the 
dynamics. The Reynolds number, Re, plays the role of the bifurcation parame- 
ter 2 in (12.1), i.e., it is a parameterized dynamical system. 
The data we consider comes from numerical simulations of the incompress- 
ible Navier-Stokes equation: 
A +u: Vu+Vp-— avn = 0, (13.29) 


ðt 
V-u=0, (13.30) 


where u(z, y, t) € R? represents the 2D velocity, and p(x, y, t) the corresponding 
pressure field. The boundary conditions are as follows: (i) constant flow of u = 
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Figure 13.21: The sparse representation for classification (SRC) algorithm for 
library mode selection; see Section 3.6| for more details. In this mathematical 
framework, a sparse measurement is taken of the system and a highly 
under-determined system of equations PW;a = ù is solved subject to ¢; pe- 
nalization so that ||a||ı is minimized. Illustrated is the selection of the uth POD 
modes. The bar plot on the right depicts the non-zero values of the vector a, 
which correspond to the Y, library elements. Note that the sampling matrix 
P that produces the sparse sample u = Pu is critical for success in classifica- 
tion of the correct library elements Y, and the corresponding reconstruction. 
Reproduced from Kutz et al. [424]. 


(1,0)? atx = —15, i.e., the entry of the domain; (ii) constant pressure of p = 0 
at x = 25, i.e., the end of the domain; and (iii) Neumann boundary conditions, 
i.e., Qu/On = 0 on the boundary of the domain and the cylinder (centered at 
(x,y) = (0,0) and of radius unity). 

For each relevant value of the parameter Re, we perform an SVD on the data 
matrix in order to extract POD modes. It is well known that, for relatively low 
Reynolds number, a fast decay of the singular values is observed so that only a 
few POD modes are needed to characterize the dynamics. Figure [13.22]shows 
the three most dominant POD modes for Reynolds numbers Re = 40, 150, 300, 
and 1000. Note that 99% of the total energy (variance) is selected for the POD 
mode selection cut-off, giving a total of one, three, three, and nine POD modes 
to represent the dynamics in the regimes shown. For a threshold of 99.9%, more 
modes are required to account for the variability. 

Classification of the Reynolds number is accomplished by solving the op- 
timization problem and obtaining the sparse coefficient vector a. Note 
that each entry in a corresponds to the energy of a single POD mode from 
our library. For simplicity, we select a number of local minima and maxima 
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Figure 13.22: Time dynamics of the pressure field (top panels) for flow around 
a cylinder for Reynolds numbers Re = 40, 150, 300, and 1000. Collecting snap- 
shots of the dynamics reveals that low-dimensional structures dominate the 
dynamics. The dominant three POD pressure modes for each Reynolds num- 
ber regime are shown (bottom panels) in polar coordinates. The pressure scale 
is in magenta (bottom right). Reproduced from Kutz et al. [424]. 


of the POD modes as sampling locations for the matrix P. The classification 
of the Reynolds number is done by summing the absolute value of the coeffi- 
cient that corresponds to each Reynolds number. To account for the large num- 
ber of coefficients allocated for the higher Reynolds number (which may be 16 
POD modes for 99.9% variance at Re = 1000, rather than a single coefficient 
for Reynolds number 40), we divide by the square root of the number of POD 
modes allocated in a for each Reynolds number. The classified regime is the 
one that has the largest magnitude after this process. 

Although the classification accuracy is high, many of the false classifica- 
tions are due to categorizing a Reynolds number from a neighboring flow, i.e., 
Reynolds number 1000 is often mistaken for Reynolds number 800. This is due 
to the fact that these two Reynolds numbers are strikingly similar and the algo- 
rithm has a difficult time separating their modal structures. Figure{13.23|shows 
a schematic of the sparse sensing configuration along with the reconstruction of 
the pressure field achieved at Re = 1000 with 15 sensors. Classification and re- 
construction performance can be improved using other methods for construct- 
ing the sensing matrix P [[112! {122} [136] (376) 491} (619) [732]. Regardless, this ex- 


ample demonstrates the usage of sparsity-promoting techniques for POD mode 
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Figure 13.23: Illustration of m sparse sensor locations (left panel) for classifica- 
tion and reconstruction of the flow field. The selection of sensory /interpolation 
locations can be accomplished by various algorithms 
[619,732]. For a selected algorithm, the sensing matrix P determines the classi- 
fication and reconstruction performance. Reproduced from Kutz et al. [424]. 


selection (¢, optimization) and subsequent reconstruction (l projection). 

Finally, to visualize the entire sparse sensing and reconstruction process 
more carefully, Fig. [13.24] shows both the Reynolds number reconstruction for 
the time-varying flow field along with the pressure field and flow field recon- 
structions at select locations in time. Note that the SRC scheme along with 
the supervised ML library provide an effective method for characterizing the 
flow strictly through sparse measurements. For higher Reynolds numbers, it 
becomes much more difficult to accurately classify the flow field with such a 
small number of sensors. However, this does not necessarily jeopardize the 
ability to reconstruct the pressure field, as many of the library elements at 
higher Reynolds numbers are fairly similar. 
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Figure 13.24: Sparse-sensing Reynolds-number identification and pressure- 
field reconstruction for a time-varying flow. The top panel shows the actual 
Reynolds number used in the full simulation (thick solid lines) along with its 
compressive sensing identification (crosses). Panels A-E show the reconstruc- 
tion of the pressure field at five different locations in time (top panel) demon- 
strating an accurate (qualitatively) reconstruction of the pressure field. (The left 
side the simulated pressure field is presented, while the right side contains the 
reconstruction.) Note that, for higher Reynolds numbers, the classification be- 
comes more difficult. Reproduced from Bright et al. [112]. 
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Homework 


Exercise 13-1. Consider the three functions: 


f(x) = xexp(—z”), (13.31) 
f(x) = exp[—(z — 0.5)?] + 3exp[—2(2 + 3/2)°], (13.32) 
f(x) = sin(7x/8) cos(nrz/4), (13.33) 


on the interval x € |—4, 4]. Approximate each function with n > 1 points and 
use r random point measurements to reconstruct the function using (i) the 
first r Gauss—Hermite functions, and (ii) the first r/2 cosine and sine modes 
cos(nma/L) and sin(ntx/L), where n = 0, 2,...,7/2. For the reconstruction, pro- 
duce the least-squares error as a function of the rank r. Ensemble the results by 
considering a large number of random point measurements to produce a mean 
and variance of the error statistics. 


Repeat the experiment above but use the OR algorithm and the DEIM algo- 
rithm to compute the r random point measurement locations. Compare the 
error to the statistical distribution of errors for random measurement locations. 


Exercise 13-2. Train a decoder network that maps high-fidelity, well-resolved 
solutions for the Kuramoto-Sivashinsky (KS) equation in a parameter regime 
where spatio-temporal chaos is exhibited to randomly chosen point measure- 
ments of the system. With test data, evaluate the performance of the decoder as 
a function of the r point measurements. Also evaluate statistically the stability 
of the decoder for reconstruction as a function of the random point measure- 
ment locations. 


Exercise 13-3. Consider the nonlinear Schrödinger equation solver with DEIM 
and QDEIM integration. Repeat the experiment of constructing a ROM using 
r random measurements. Determine the value of r for which the ROM model 
gives similar performance to DEIM and QDEIM with high probability. 
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Chapter 14 


Physics-Informed Machine Learning 


In this chapter, many of the critically enabling aspects of machine learning are 
brought together for modeling problems in science and engineering. Specifi- 
cally, the goal of this chapter is to highlight a number of methods that have 
been recently developed under the aegis of physics-informed machine learn- 
ing. The process of machine learning may be broken down into a number of key 
stages, each of which provides an opportunity to embed or enforce prior phys- 
ical knowledge: (1) formulating a problem to model, (2) collecting and curating 
the data used to train the model, (3) choosing an architecture to best represent 
or model the data, (4) designing a loss function to assess the performance of the 
model and guide the learning process, and (5) selecting and implementing an 
optimization algorithm to train the model to minimize the loss function over 
the training data. In the following, several of these physics-informed machine 
learning strategies will be investigated. 

These techniques often involve the training of neural networks in the over- 
all workflow. The neural networks will be denoted by 


fo(x), (14.1) 


where @ are the neural network weights and f(-) characterizes the network ar- 
chitecture (number of layers, structure, regularizers). The weights 0 are then 
optimized to minimize a loss function over training data X: 


0° = argmin,£(0, X). (14.2) 


The various physics-informed networks highlight interesting structures and 
constraints imposed on the model fg and loss function £. Often multiple neural 
networks are trained simultaneously in order to exploit a structural constraint 
of a spatio-temporal system. Moreover, targeting the use of deep learning is 
important to retain interpretability and explainability of models in the discov- 
ery process. This is not an exhaustive survey, but rather a targeted exploration 
of some methods that have had broad appeal in the community due to their 
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effectiveness, ease of use and/or interpretability. More details can be found in 
recent reviews [224] [111] [131] 375]. 

Importantly, parsimony has long been a guiding principle in physical mod- 
eling, favoring the simplest model that describes the data to avoid overfitting 
and promote generalization. This principle of parsimony has been central in 
physics for centuries, from Aristotle to Einstein. In modern machine learning, 
parsimony is still a guiding principle for interpretable and generalizable mod- 
els, and it may be enforced through (i) a low-dimensional coordinate system, 
(ii) a sparse representation of governing equations, or (iii) in capturing para- 
metric dependencies. 


14.1 Mathematical Foundations 


Data-driven models are an emerging and important paradigm for science and 
engineering. They also provide the foundational mathematical framing required 
for virtual instantiations of physical systems, i.e., the digital twin. Specifically, 
digital twins integrate with Kalman filtering architectures, which together pro- 
duce predictions that are a combination of models and data. These data-driven 
discovery tools are achieved using simple regression techniques outlined in 
previous chapters that can often lead to improved interpretable and general- 
izable models. Although DNN architectures are used to learn physics, there 
remain critical issues concerning generalizability, interpretability, overfitting, 
and significant data requirements, limiting their usefulness and computational 
tractability for data-driven models. Regardless of the method used, they are 
compromised in practice by limited data, corruption due to noise, unmeasured 
latent variables, parametric dependences, and unaccounted-for physics. The 
targeted use of data-driven techniques provides a structure for model reduc- 
tion, much like autoencoder structures, which facilitates a rapid and adaptive 
ROM construction paradigm with a diversity of potential methods for learning 
time evolution (DMD, Koopman, SINDy, etc.). 

An overarching goal in data-driven modeling is to leverage machine learn- 
ing algorithms to learn physically interpretable and generalizable models of 
spatio-temporal systems from offline and/or online streaming data. There are 
three critical scenarios to consider, corresponding to when the baseline physical 
model, or parametric form, is known, unknown, or only partially known. Thus 
we seek to perform system identification from data y € R” to learn a model 
in a high-dimensional state space x € 4X C R”, where n > 1; often p < n. 
Specifically, 


x = f(x, t,0, wa), (14.3a) 
y = h(t, x(t)) + Wn, (14.3b) 
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where the dynamics are prescribed by f : X — R”, and the observation op- 
erator is h : ¥ — R’. The measurements typically occur at discrete times ty, 
in which case they are denoted by y;,. Observations are compromised by mea- 
surement noise w,,, which is typically described by a probability distribution 
(e.g., a normal distribution w,, ~ N(1,0)). The dynamics are prescribed by a 
set of parameters 0. Moreover, the dynamics may be subject to stochastic effects 
characterized by wy. 

The goal is as follows: Given m measurements y, arranged in the matrix 
Y = [yi yo t Ym] € R’*™, infer the dynamics f(-) (or unknown portion of 
dynamics) with parameterization 0, the measurement operator h(-), or a proxy 
model of the true system, so that tasks such as control and forecasting can be 
accomplished. Adding to the difficulty of the task are multi-scale and multi- 
physics phenomena. Even the simplest multi-scale system can challenge many 
data-driven methodologies. To be more specific, a simple two-scale system, 
for example, represents difficulties in trying to extract the governing equations 
which are modified to: 


cs! = fi (x1, X2,t,7, 01, Wa1) and ee! = fo(x1, Xo, t, Ts 62, Wa2), (14.4) 
dt dr 
where rT = et (with e < 1) is a slow scale. 

If h(-) is not the identity and/or w, is not zero, then we have imperfect data. 
In general, inferring f(x) is an ill-posed problem whose solution must be ac- 
complished through judiciously chosen regularization. In the case of (14.4), this 
is accomplished through first decomposing the data into its constitutive fast 
and slow timescales. 

Solving the ill-posed problem is a fundamental scientific and math- 
ematical challenge. To date, it has only been accomplished in highly special- 
ized settings, typically with full-state measurements and high-quality (low- 
noise) data. Significant mathematical innovations are still required in order to 
make this a general and robust architecture. The multi-scale equation re- 
mains exceptionally challenging since it requires the integration of a broad set 
of mathematical tools. 


Sensors and Limited Data 


Everything starts with data acquisition. This is often overlooked in machine 
learning methods, which assume that access to the correct variables is avail- 
able. Thus the mapping h(-) and its inverse are important to learn. For many 
complex systems, the latent variable space is an important aspect of the dis- 
covery process. For instance, time-delay embeddings [126], and recourse to 
Taken’s theorem, help establish a critical connection to dynamical systems the- 
ory and a potential reconstruction of the full state space with limited measure- 
ments. There are four critical aspects to developing a robust sensing frame- 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


14.1. MATHEMATICAL FOUNDATIONS 631 


work: (i) sensor placement, (ii) sensor cost, (iii) discovery of the measurement 
map h(-), and (iv) multi-modal data integration from diverse sensor types (video, 
audio, thermal, etc.). 

Recent efforts have established some of the earliest rigorous mathematical 
results on formulating optimal sensor placement and minimal cost strategies 
for complex spatio-temporal systems [481]. New fundamental mathematical 
innovations are required to identify near optimal sensor locations for systems 
with nonlinear manifold embeddings, which are typical of real data. DNNs can 
be used for decoder networks capable of producing a highly improved map- 
ping between the data and the underlying state space [235]. It would also be ad- 
vantageous to use the time-delay embedding structure to try and reconstruct, 
as best as possible, the latent variables and to reframe the greedy algorithms 
based upon the time-delay data. To date, it is unknown what the limits and 
mathematical possibilities are for using such a method to extract the full-state 
variable x from measurements y. In addition to extracting critical information 
on the state space, DNNs have been recently shown to be capable of de-noising 
data sets in a manner that is comparable to, and in many cases better than, 
Kalman filtering methods. Potentially helping improve these results are multi- 
modal data fusion techniques which can be potentially used to help improve 
decision making or predictions. Sensors are critical for determining h(-), and 
targeted use of DNNs suggests robust architectures for making the best 
use of data collected for model discovery. 


Coordinate Discovery and Data Representation 


Data processed from the multi-modal sensors can discover a transformation 
z = g(x) for parsimonious, low-dimensional dynamics [168] |465] 


z= F(z,t,0,wa), (14.5) 


where z € Z C R” is an r-dimensional (r « n) model of the physics speci- 
fied by F(-). Ultimately, the discovery of the nonlinear transform g(-), through 
training neural network autoencoders, gives the coordinates for parsimonious 
dynamics F(-). 

If only limited data is available, then it may be required to produce a low- 
fidelity, online model using a linear map. This can be done with an r-rank 
SVD/POD mode truncation of the snapshots of x. Dynamic mode decompo- 
sition, or a Koopman approximation using augmented state-space measure- 
ments, can then be used on this low-rank subspace to produce the best-fit lin- 
ear model through the data. This provides a baseline architecture for diagnos- 
tics and forecasting. As more data is required, a full nonlinear mapping and 
nonlinear model can be used to refine the results, both in terms of building 
a lower-rank nonlinear subspace and for producing a parsimonious nonlinear 
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dynamics. As sufficient data is acquired from the sensors, the data-discovery 
pipeline then produces the flow 


y € R (measurements) —> x € R” (state space) —> z € R” (ROM)(14.6) 
P 


with two mappings to discover, h and g. With limited data, SVD provides a lin- 
ear approximation. Refinement of the linear approximation can be learned over 
time using architectures where the identity (linear) mapping is the leading- 


order approximation [279] [544]. 


Overall, solving the ill-posed problem is the central aim of physics-informed 
machine learning. In what follows, a diversity of techniques are presented which 
leverage the data to build advantageous models (neural networks, for instance) 
that can be used for forecasting and characterization. 


14.2 SINDy Autoencoder: Coordinates and Dynam- 
ics 


In this first vignette on physics-informed machine learning, we explore an ar- 
chitecture capable of jointly and simultaneously learning coordinates and par- 
simonious dynamics. Specifically, Champion et al. present a method for 
the simultaneous discovery of sparse dynamical models (SINDy) and coordi- 
nates (autoencoders) that enable these simple representations. The aim in the 
architecture is to leverage the parsimony and interpretability of SINDy with 
the universal approximation capabilities of deep neural networks to discover 
an appropriate coordinate system in which to embed the dynamics. This can 
produce interpretable and generalizable models capable of extrapolation and 
forecasting, since the dynamical model is minimally parameterized. The archi- 
tecture is shown in Fig. where an autoencoder is used to embed the origi- 
nal data x into a new coordinate z amenable to a parsimonious representation. 
While in the original coordinate system a dynamical model may be dense in 
terms of functions of the original measurement coordinates x, this method de- 
termines through an autoencoder a reduced coordinate system z(t) = y(x(t)) € 

R” (r < n) where the following dynamical model holds: 
ZO _ g(z()) (14.7) 


Specifically, a parsimonious description of the dynamics is sought where g con- 
tains only a few active terms from a SINDy library. Thus, in addition to a dy- 
namical model, the method learns coordinate transforms y and w that map the 
measurements to intrinsic coordinates via z = y(x) (encoder) and back via 


x ~ W(z) (decoder). 
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Figure 14.1: Schematic of the SINDy autoencoder method for simultaneous dis- 
covery of coordinates and parsimonious dynamics. (a) An autoencoder archi- 
tecture is used to discover intrinsic coordinates z from high-dimensional in- 
put data x. The network consists of two components: an encoder (x), which 
maps the input data to the intrinsic coordinates z, and a decoder w(z), which 
reconstructs x from the intrinsic coordinates. (b) A SINDy model captures the 
dynamics of the intrinsic coordinates. The active terms in the dynamics are 
identified by the non-zero elements in £, which are learned as part of the NN 
training. The time derivatives of z are calculated using the derivatives of x and 
the gradient of the encoder y. The bottom panel shows the pointwise loss func- 
tion used to train the network. The loss function encourages the network to 
minimize both the autoencoder reconstruction error and the SINDy loss in z 
and x. Also L; regularization on & is included to encourage parsimonious dy- 
namics. From Champion et al. [168]. 


The autoencoder is a flexible, feedforward neural network that allows one 
to discover underlying low-dimensional coordinates in which to represent the 
data. Thus the layers of the autoencoder learn a latent representation of a new 
variable in which to express the data, in this case the evolution dynamics. Of- 
ten an autoencoder is used for classification and prediction. However, here, 
its targeted use is for learning a new coordinate system. Section 6.8] highlights 
the use of an autoencoder to discover a low-dimensional modal embedding 
for flow around a cylinder. The network is trained to output an approximate 
reconstruction of its input, and the restrictions placed on the network archi- 
tecture (e.g., the type, number, and size of the hidden layers) characterize the 
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intrinsic coordinates [290]. The autoencoder gives a nonlinear generalization of 
a principal component analysis (PCA) [46]. 

In many science and engineering applications, the goal is to determine the 
underlying intrinsic coordinate system that best characterizes the data. For in- 
stance, in celestial mechanics, it took one and a half millennia to discover that 
a heliocentric coordinate system was a more appropriate choice of coordinates. 
This quickly led to the discovery of the F = ma model of gravitation. Indeed, 
this pairing of coordinates and model is exactly what the SINDy autoencoder 
attempts to automate. In practice, it is common to discover intrinsic coordi- 
nates z that are much lower in dimension than the original state-space obser- 
vations x. The autoencoder learns a nonlinear embedding from measurement 
data x(t) € R” to an intrinsic coordinate z(t) € R”, where r < n is chosen as a 
hyperparameter prior to training the network. 

Autoencoders can learn a low-dimensional representation in isolation with- 
out need to specify any other constraints. This is exactly what was done in Sec- 
tion|6.8]to embed fluid flow in a nonlinear coordinate system. Without further 
specifications, the intrinsic coordinates learned have no particular meaning or 
interpretation. However, if, in the latent space, additional constraints are im- 
posed, then additional structure and meaning can be imposed on the model. 
For the SINDy autoencoder model, the network is required to learn coordinates 
associated with parsimonious dynamics. Thus it integrates the sparse regres- 
sion framework of SINDy in the latent space, or intrinsic coordinates z. This 
constraint in the autoencoder provides a regularization framework whereby 
model discovery is achieved by constructing a library © (z) = [8 (z), O2(z),..., 0,(z)| 
of candidate basis functions, e.g., polynomials, and learning a sparse set of co- 
efficients = = [=),...,=,| that defines the dynamical system 


dz(t) _ 
dt 


Typical of SINDy, the library is specified before training occurs, where li- 
brary loadings (coefficients) = are learned along with the autoencoder weights 
during training (optimization). Importantly, the derivatives x(t) of the original 
states are computed in order to pass these along to the encoder variables as 
z(t) = Vxp(x(t))x(t). This helps enforce accurate prediction of the dynamics 
by incorporating the loss function: 


Lazjat = ||Vxie(x)& — O((x)")E ll. (14.8) 
This term uses both the typical SINDy regression along with the gradient of 
the encoder to promote learning of a sparse dynamical model which accurately 


predicts the time derivatives of the encoder variables. Additional loss terms 
require that the SINDy predictions accurately reconstruct the time derivatives 
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of the original data: 
Laxjat = |X — (Vax) (O) B) (14.9) 
These loss terms (14.8) and (14.9) are added to the standard autoencoder loss 


Lrecon = I|x an plex) 


which ensures that the autoencoder can accurately reconstruct the original in- 
put data. To help promote sparsity in the SINDy architecture, an ¢; regulariza- 
tion penalty is included on the SINDy coefficients =. This promotes a parsimo- 
nious model for the dynamics by selecting only a small number of terms. The 
combination of the above four terms gives the following overall loss function: 


Lrecon + A1Lax/at + A2Laz/at + A3L reg, 


where the hyperparameters \;, A2, and A3 determine the relative weighting of 
the three terms in the loss function. 

In addition to the 4; regularization, sequential thresholding has been shown 
to be an effective proxy for the fo-norm [775]. This technique is inspired by 
the original algorithm used for SINDy [132], which combined least-squares fit- 
ting with sequential thresholding to obtain a sparse model. Thresholding is 
applied at fixed intervals throughout the training, with all coefficients below 
the threshold being set to zero and training resuming using only the terms 
left in the model. The Adam optimizer provides a robust framework for 
the optimization procedure. In addition to the loss function weightings and 
SINDy coefficient threshold, training requires the choice of several other hyper- 
parameters, including learning rate, number of intrinsic coordinates r, network 
size, and activation functions [168]. Figure[14.2|shows the SINDy autoencoder 
method applied to a video of a pendulum. From the video, it is able to learn 
the underlying variables that characterize the pendulum motion, i.e., the an- 
gle and its angular velocity. These latent state-space variables are learned by 
enforcing SINDy, thus producing the coordinates and dynamics ž = — sin z of 
the model correctly. This framework allows for going straight from videos to 
physics discovery models. 

The basic architecture developed by Champion et al. is highly flexible. 
Indeed, it has already been illustrated in Section 7.4] to encode a Koopman op- 
erator [465]. Figure[14.3]shows how it can also be used to embed the dynamics 
into its normal-form dynamics near instabilities [369]. Normal forms are excep- 
tional representations of the dynamics, as they capture the underlying intrinsic 
behavior with minimal parameterization. They also highlight the nature of un- 
derlying instabilities, which is a critical component for understanding pattern 
forming systems, for instance. 
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Nonlinear pendulum 


zZ =—0.99sin z 
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Figure 14.2: Illustration of the SINDy autoencoder method whereby high- 
dimensional input data x(t) € R” (video of a pendulum) is used as the input 
data stream. The autoencoder constructs a low-rank embedding and enforces 
SINDy. As such, it discovers a latent variable space z(t) € R” where r = 2. 
Specifically, it discovers the angle of the pendulum (z) and its angular velocity 
%. SINDy then learns (approximates) the underlying dynamics z = — sin z. 


14.3 Koopman Forecasting 


Dynamic mode decomposition and Koopman theory have already been in- 
troduced in previous chapters. Highlighted here is an extension of Koopman 
theory whereby neural networks transform time-series data into a form more 
amenable to a Koopman representation [428]. Thus, instead of transforming the 
spatial coordinate system as in the last section, here a transformation of time is 
learned, whereby the temporal evolution is made to be as sinusoidal as possi- 
ble. In its simplest form, the underlying optimization is given by 


argmin So (xx — Az) subject to Zp41 = Bz, (14.10) 
AB i 
for data snapshots x;, model snapshots z;, and k = 1,2,...,m. This optimiza- 


tion framework is similar to DMD [422] which regresses to a best fit linear 
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Saddle-node Pitchfork Transcritical Hopf 


Figure 14.3: Instabilities lead to canonical pattern formation in various physical 
systems that are characterized by underlying normal forms. The autoencoder 
collapses data to the underlying normal-form coordinates (z, 3), with bifurca- 
tion parameter 6. The dynamics on the reduced coordinates (z, 8) are given by 
normal-form equations, which are typically given by four different canonical 
forms. From Kalia et al. [369]. 


model. However, in this formulation, both a linear dynamic model B is learned 
along with a linear mapping to the data A, allowing the mapping to be gener- 
alized to a neural network embedding. It is assumed that the data is collected 
over a time frame t € |0, T]. In the context of forecasting, it is typically advanta- 
geous to produce long-term forecasts of a system, which would further require 
enforcing R{Eig(B)} = 0. Such a constraint guarantees that the solutions do not 
decay to zero or grow to infinity. Enforcing this constraint allows us to rewrite 
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the original optimization problem as 


sin(w tp) 
. = _ sin(wyty) _ . = _ 2 
sr as x,-A cnet = Erg D(x A(Q(wt,)))*, (14.11) 
cos(wytr) 


where the model fit is to N distinct frequencies. This is a constrained version 
of Koopman. Koopman generically fits to exponentials, allowing for real and 
imaginary parts of the eigenvalues. This fitting procedure is constrained only 
to the imaginary part, which, as will be shown, allows for an exceptional fore- 
casting tool for data that is periodic or quasi-periodic in nature. 

An obvious connection to make is with the Fourier transform, and more 
specifically its computational engine, the fast Fourier transform (FFT). FFT also 
transforms a given time series into a frequency representation. The FFT con- 
structs its representation with frequencies that are periodic on the time interval 
t € [0, T]. This is problematic for signals that display only a fraction of a period. 
Specifically, the Gibbs phenomenon is generated due to the periodic continu- 
ation enforced by the FFT. Thus many high frequencies are generated which 
are artificial in nature, since the solution is forced to be periodic on t € [0, T]. 
This makes forecasting with FFT difficult and inaccurate unless the data is sam- 
pled perfectly on periodic data. The regression provides a more gen- 
eral framework, as the frequencies are determined during optimization and no 
underlying periodicity on the time interval t € [0,7] is assumed. As a con- 
sequence, a non-convex optimization must then be performed, which is often 
detrimental for gradient descent methods, which get stuck in local minima. To 
overcome the issues with non-convex optimization, the FFT is used to seed the 
gradient descent algorithm used for (14.11), thus providing a more stable al- 
gorithm for the frequency fitting procedure. This allows the global properties 
of the FFT to inform the local frequencies that should be optimized upon in 
gradient descent [428]. 

In addition to generalizing the FFT, Lange et al. go further and warp the 
time-series data to be more amenable to Fourier analysis. Specifically, neural 
networks are used to transform data from its original form into data that is 
more sinusoidal in nature with as few frequencies as possible. This is done by 
replacing the linear operator A in (14.11) with a nonlinear (neural network) 
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Table 14.1: Performance in long-term forecasting of distribution-level energy 
consumption as measured by the relative cumulative error for various algo- 
rithms. Note that long-term predictions are obtained by recursively feeding 
predictions back into algorithms where applicable. Furthermore, the column 
“Patterns” indicates whether the algorithms at hand have successfully ex- 
tracted daily (D), weekly (W) or yearly (Y) patterns. 


Algorithm Forecast horizon Patterns 
25% | 50% | 75% | 100% | D | W | Y 
Koopman forecast 0.19 | 0.21 | 0.19 | 019 |} V¥ |v Jv 
Fourier forecast 0.31 | 0.39 | 0.33 | 0.30 Viv ldo 
LSTM 0.37 | 0.4 | 0.42 | 045 | Y | xX |x 
GRU 0.53 | 0.55 | 0.52 | 0.50 |] 4 | X |X 
Echo state network 0.67 | 0.73 | 0.76 | 0.73 V | X |x 
AR(1, 12, 24, 168, 4380,8760) || 0.75 | 0.95 1.07 | 113 |74 | Vv |v 
CW-RNN (data clocks) 1.10 | 1.14 | 1.14 | 115 (V) xX |x 
CW-RNN 1.05 | 1.08 | 1.08 | 1.09 | (vV) X | xX 
AutoARIMA 0.83 | 1.11 | 1.18 | 1.26 x xX |X 
Temporal convolutional nets || 0.96 | 1.69 | 1.87 | 2.33 | YW | (W)| x 
Fourier neural networks 1.10 | 1.15 | 1.21 | 1.21 Y | X |x 
transformation 
sin(w th) i 
. u _ sin(wyty) E P m _ 2 
ae 2 Xk — fo dosa = oe dle fo (Q(wt.)))*, 
cos(wytg) 


(14.12) 
where fọ defines the neural network that is learned for best representing the 
signal in N learned frequencies. This problem is nonlinear and non-convex, yet 
global optima can be computed [428]. Indeed, the loss function is periodic in 
nature, and this is exploited in the training process as well. 

The Koopman forecasting method that is enabled by training pro- 
vides a mid- to long-term forecasting algorithm that has superior performance. 
Table{14.1|compares a number of techniques on power grid data, from leading 
statistical time-series methods to state-of-the-art machine learning algorithms 
for time series, against the Koopman forecasting tool along with its 
Fourier forecasting counterpart (14.11). Both methods provide superior perfor- 
mance, with the Koopman forecasting producing improvements that are sig- 
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nificant. 

A number of further examples are given in Fig. Specifically, it shows 
the performance of the Fourier-based algorithm graphically, predicting the last 
time frame for a Kolmogorov 2D flow, flow around a cylinder, flow over a cav- 
ity, and a video of a fan. The data from the Kolmogorov 2D flow was taken 
from the experiments conducted in [705]. Note that the Kolmogorov 2D flow 
and the data for the video frame prediction task constitute real measurements 
and therefore exhibit a considerable amount of noise, whereas the cylinder and 
cavity flow data are from simulation. The codes for the Fourier and Koopman 


forecasting algorithms are available at https ://github.com/helange23/ 
from_fourier_to_koopman 


14.4 Learning Nonlinear Operators 


The universal approximation capabilities of neural networks are well known. 
Specifically, neural networks can generically approximate any continuous func- 
tion. More recently, Lu et al. (DeepONet) and Kovachki et al. (neural 
operator) have highlighted results by Chen and Chen that prove that neu- 
ral networks with a single hidden layer can accurately approximate any non- 
linear continuous operator. Thus a nonlinear operator is learned mapping func- 
tions to functions. In practice, this is perhaps an even more impactful theory, as 
the operator is often the more important quantity to compute, since the oper- 
ator contains information about the physics and dynamics of the system. Note 
that this approach is fundamentally different than what was considered in the 
previous section with Koopman theory. Koopman theory attempts to approxi- 
mate the dynamics with a linear operator while Lu et al. and Kovachki et 
al. directly construct a nonlinear operator using neural networks. 

The original work of Chen and Chen constructs a universal approx- 
imation proof of an operator that DeepONet constructs through training. The 
theorem is the following: 


Theorem (Universal approximation theorem for operators). Suppose that o is 
a continuous non-polynomial function, X is a Banach space, Kı C X and Kə C RI 
are two compact sets in X and R“, respectively, V is a compact set in C(K,), and G 
is a nonlinear continuous operator, which maps V into C(K2). Then, for any e > 0, 
there are positive integers n, p,m, constants c}, €F., 0°, Cx € R, wp E€ R4,and x; € Ky, 
where i = 1,2,..., n, k = 1,2,..., p, and j = 1,2,...,m, such that 


G(u)(y) -YY to (Eeun + s) o(wey + G) 


k=1 i=1 


<e (14.13) 


holds for all u € V and y € Kə. 
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ES ss 


Figure 14.4: The last frame as predicted by PCA in conjunction with the 
Fourier-based algorithm of fluid flows and video frame prediction. For a video 


that shows the performance, visit https: //www. youtube. com/watch?v= 
From Lange et al. [428]. 


The theorem provides theoretical bounds on the ability of a neural network to 
approximate the operator G(-). The theorem also highlights the construction of 
two neural networks, so that it can be more compactly represented as 


IG (u)(y) — fo, (u) - fo, (y)| < € (14.14) 


when considering the discretized representation of u(x) + u and new measure- 
ment (function evaluation) locations y —> y. Figure highlights the neural 
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Figure 14.5: Architecture for learning nonlinear operators. The DeepONet 
trains two networks: (i) a branch network fg, that maps the original field vari- 
able u evaluated at m measurement points x, (where uz, = u(x;,)) to a latent 
representation H,; and (ii) a trunk network fọ, that maps arbitrary and new 
spatial locations y to a latent representation H, € |. The operator is then given 
by the expression from Chen and Chen as G = fo, - fo,. 


network trained by Lu et al. that leverages the universal operator ap- 
proximation theorem of Chen and Chen [174]. The two simultaneously trained 
networks are called the branch network fg, (u) and the trunk network fo, (y). 
Mathematically, the concept is quite simple. Given a number of measure- 
ment (sensor) locations x; (usually selected from a computational grid) which 
prescribes the input function up = u(x;,), a vector of training input data u can 
be constructed. The input data has corresponding output data G(u). In addi- 
tion, training data mapping selections of random measurement points y to 
the output G(u)(y) is required. Thus the input functions u are encoded in a 
separate network than the location variables y. These are merged at the end, 
as shown in the universal approximation proof of Chen and Chen [174]. Fig- 
ure shows the results of training from the original DeepONet paper of 
Lu et al. on a reaction—diffusion system. DeepONet also can achieve 
small generalization errors by employing inductive biases. Remarkably, expo- 
nential convergence is observed in the deep learning algorithm. The code is 
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A A random u sample 


s(x, t) at P random locations 
a 


0.5 - Training data from one u 


0.3 (u, (x4, t1), $(X4, t1)) 


0.1 (u, (x2, ta), s(x2, t2)) 


-0.2 (u, (Xp, tp), S(xp, tp)) 


- Repeat for different u 


Figure 14.6: Learning a reaction—diffusion system with DeepONet. (A) An ex- 
ample of a random sample of the input function u(x) (left). The corresponding 
output function s(x,t) at P different (x,t) locations (middle). Pairing of inputs 
and outputs at the training data points (right). The total number of training 
data points is the product of P times the number of samples of u. (B) Training 
error (blue) and test error (red) for different values of the number of random 
points P when 100 random u samples are used. (C) Training error (blue) and 
test error (red) for different number of u samples when P = 100. The shaded 
regions denote one standard deviation. From Lu et al. [461]. 


available athttps://github.com/lululxvi/deepxde 


Neural operators are a closely related method for producing mappings be- 
tween function spaces, thus allowing for the approximation of operators that 
encode governing equations and physics [4071 |441] 442) 443]. Neural operators 
generalize standard feedforward neural networks to learn mappings between 
infinite-dimensional spaces of functions defined on bounded domains of R°. 
The non-local component of the architecture is instantiated either through a pa- 
rameterized integral operator or through multiplication in the spectral domain, 
which is a specific form of the kernel in the integral operator. As with Deep- 
ONet, neural operators, once trained, have the property of being discretization- 
invariant: sharing the same network parameters between different discretiza- 
tions of the underlying functional data. This is their specific advantage: neural 
operators and DeepONets are mesh-free methods once trained. 

Neural operators have a different structure than DeepONet. Specifically, 
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they leverage an integral kernel representation in their approximation of the 
operator. For instance, neural operators can make explicit use of multi-pole 
and Fourier kernels in order to construct operator representations. 
Thus nonlocal representations of the solution are parameterized by the inte- 
gral operator. Recall that learning a nonlinear operator G(-) is equivalent to 
learning the inverse of the PDE evolution u = N(u). Thus kernel operators are 
intuitively appealing for the construction of the nonlinear operator. The overall 
representation of the operator is a trained neural network 


G=f, (14.15) 


where individual layers of the neural network are constructed from learned 
integral representations that are updated according to the following: 


E ere (Wan Ke, yjusly) dely) + Pala) ) (14.16) 


Dk 


where vp is a Lebesgue measure on R*. The kernel K“*) (x,y) is typically cho- 
sen to leverage advantageous representations, such as the multi-pole or Fourier 
kernels. Thus each layer of the network is trained using a physics-inspired con- 
cept of an integral (inverse) representation of the PDE dynamics. The kernel 
representation is strongly motivated by the concept of the Green’s function, 
which provides a fundamental solution for linear PDEs by expressing the solu- 
tion as an integration over the Green’s function kernel. 

Although both neural operators and DeepONets accomplish the same goal, 
they do so with different architectures. Neural operators exploit the kernel 
structure of generic operators, while DeepONets train by separating the input 
function from the spatial locations. Both have achieved promising results, high- 
lighting the fact that the learning of operators can potentially allow for mesh- 
free models of physics systems. Of course, in order for this to be viable in prac- 
tice, exceptional large training data that resolves all scales is required for train- 
ing. Figure [14.7|highlights the results from Kovachki et al. where neural 
operators are used to model fluid flows. The codes are available at 
ea 


fourier_neural_operator 


14.5 Physics-Informed Neural Networks (PINNs) 


An elegant solution technique for solving many physics-based problems is the 
physics-informed neural network (PINN) pioneered by Raissa, Perdikaris, and 
Karniadakis [584]. The method is simple in concept: find a solution to a PDE 
by enforcing satisfaction of the PDE in the neural network loss function. The 
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Initial Vorticity t=15 


Prediction 


Figure 14.7: Zero-shot super-resolution. The vorticity field of the solution to 
the two-dimensional Navier-Stokes equation with viscosity 10* (Re = O(200)). 
The ground truth is on the top and prediction on the bottom. The model is 
trained on data that is discretized on a uniform 64 x 64 spatial grid and a 20- 
point uniform temporal grid. The model is evaluated with a different initial 
condition that is discretized on a uniform 256 x 256 spatial grid and an 80-point 
uniform temporal grid. From Kovachki et al. [407]. 


method can be used both for approximating the solution of a PDE and also for 
system identification, much like the SINDy algorithm. 

To be more mathematically precise, we again consider generically a system 
of nonlinear PDEs of a single spatial variable that can be modeled as 


u, = N(u, Uy, tgs, ..., £, t; 8) +g, (14.17) 


where the subscripts denote partial differentiation, g is a forcing, and N(-) pre- 

scribes the generically nonlinear evolution. The parameter 6 will represent 

a bifurcation parameter for our later considerations. Further, associated with 

are a set of initial and boundary conditions on a domain x € [—L, L]. 
PINNs define the function 


f := u, — N(u, ug, Uge,..., £, t; B). (14.18) 


The original formulation by Raissa et al. was generalized to the represen- 
tation of the PDE as £(u, 0). Figure [14.8|highlights the basic architecture. The 
goal is to determine an approximation ù to the spatio-temporal data u. Not only 
should the approximation satisfy the PDE, it should also fit the actual data and 
its boundary and initial conditions. To be more precise, there is a neural net- 
work trained to map the spatio-temporal location to the data. The loss function 
for this is given by 


N 
La =) |lu- ål. (14.19) 


k=1 
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Figure 14.8: Schematic of a PINN architecture, where the loss function of the 
trained neural network contains a mismatch in the given data on the state vari- 
ables and/or boundary and initial conditions. In addition, the loss function is 
required to minimize the loss satisfying the PDE evolution. From Meng et al. 
|496]. 


Note that by choosing data points at x = +L and/or t = 0, the boundary 
and initial conditions are satisfied, respectively. In addition, the approximation 
should satisfy the PDE 


N 
Le = X |£ - gll- (14.20) 
k=i 


The neural network is then trained to minimize the loss functions. In summary: 
one trains the network to find a solution u that best matches the data and satis- 
fies the PDE. 

The PINN architecture was also used to perform system identification tasks, 
i.e., discover the underlying governing equations [584]. In this case, the PDE is 
formulated in much the same way as the SINDy algorithm, where now 


f := w — O(u)e, (14.21) 


and the coefficients Z are also determined in the regression process. Figure[14.8] 
shows the potential library terms in green, which can be used to construct the 
governing equations. The loadings & dictate which terms contribute. In the 
original work, the library of dynamic terms 0 was quite limited, unlike SINDy, 
which builds a large library of potential terms. 
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Figure 14.9: Burgers’ equation. (top) Predicted solution u(x,t) along with the 
initial and boundary training data. In addition, 10000 collocation points were 
used as data generated using a Latin hypercube sampling strategy. (bottom) 
Comparison of the predicted and exact solutions corresponding to the three 
temporal snapshots depicted by the white vertical lines in the top panel. From 
Raissa et al. [584]. 


Two figures illustrate the power of the PINNs. In the first (Fig.{14.9), training 
data is used to build a representation of Burgers’ equation. The PINN model 
converges to an accurate representation of the PDE dynamics and can then 
serve as a proxy to computational data. In fact, one simply needs to specify time 
and space to produce a value of the field u. In the second example (Fig. |14.10), 
the Korteweg-de Vries (KdV) equation is analyzed with the PINN. The PINN 
not only produces an accurate neural network proxy for the spatio-temporal 
field, but also further identifies the PDE dynamics by identifying the parame- 
ters in front of the appropriate terms. The success, modularity, and simplicity 
of PINNs has led to significant advancements and extensions of the method, 
many of which are reviewed by Karniadakis et al. [375]. Krishnapriyan et al. 
have also recently modified PINN architectures with improved regular- 
ization to help make them more amenable to complex systems. 


Copyright © 2021 Brunton & Kutz, Cambridge University Press. All Rights Reserved. 


648 CHAPTER 14. PHYSICS-INFORMED MACHINE LEARNING 


u(t, x) 


0.0 0.2 0.4 0.6 0.8 1.0 
t 
= 0.20 = 0.80 
199 trainng data 201 trainng data 

1:0 

2 
0.5 

C ‘4 

= 0.0 S 

—0.5 0 

—1.0 
—1 0 1 
x x 
=e Exact xX Data 
Correct PDE Ut + Uug + 0.0025Urr2 = 0 


Identified PDE (clean data) ut + 1.000uu, + 0.0025002u,.7 = 0 


Identified PDE (1% noise) ut + 0.999uu, + 0.0024996ur2. = 0 


Figure 14.10: KdV equation. (top) Solution u(x,t) along with the temporal loca- 
tions of the two training snapshots. (middle) Training data and exact solution 
corresponding to the two temporal snapshots depicted by the dashed vertical 
lines in the top panel. (bottom) Correct partial differential equation along with 
the identified one obtained by learning ©. From Raissa et al. [584]. 


14.6 Learning Coarse-Graining for PDEs 


The modeling of multi-scale physics remains particularly challenging due to 
the need of numerical algorithms to resolve spatial and temporal scales that 
can vary across many orders of magnitude. Even if one is interested in macro- 
scale phenomena, accurate models are only produced by resolving the fastest 
timescales and finest spatial resolutions. This generates significant computa- 
tional expense. Two common methods have been used to circumvent this com- 
putational expense: multi-grid methods and coarse graining. Multi-grid meth- 
ods [494] |721], for instance, have been extensively developed for physics-based 
simulation models, where coarse-grained models must be progressively refined 
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in order to achieve a required numerical precision while keeping the simula- 
tion tractable. Multi-grid architectures provide a principled method for target- 
ing the refinement process, constituting a mature field with widespread appli- 
cations in the engineering and physical sciences. In contrast, coarse-graining 
methods attempt to construct a macro-scale physics model by progressive con- 
struction of coarse-grained variables and their dynamics. Mathematical algo- 
rithms such as the heterogeneous multi-scale modeling (HMM) and equation- 
free method provide principled methods for multi-scale systems. Addi- 
tional work has focused on testing for the presence of multi-scale dynamics, 
so that analyzing and simulating multi-scale systems is more computationally 
efficient [254] 255]. 

Deep learning and neural networks provide an alternative to these multi- 
scale modeling efforts. In this case, the goal is to train a neural network to 
coarse-grain a model directly from data. Featured here is work by Bar-Sinai 
et al. [52], who, instead of deriving an approximate coarse-grained continuum 
model and discretizing it, suggest directly learning low-resolution discrete mod- 
els that encapsulate unresolved physics. Consider the governing PDE (14.17). 
Numerical discretization immediately turns the continuous PDE into an n- 
dimensional system of coupled differential equations. This is best illustrated 
by finite-difference discretization. Finite-difference methods generate a solu- 
tion vector u = [uy u2 -+> Un|’, Where uz = u(x). Derivatives are then com- 
puted by using differences in ux. For instance, the first derivative is given by 

Ouk ~ Uk+1 — Uk-1 


~ 


Ox 2Ar 


where Ax = x41 — £p. Thus the value of the field at u; depends on the neigh- 
bors uz41. This creates a global coupling between all discretization points. Of 
course, there are alternatives to differentiation using finite differences, includ- 
ing polynomial expansions and spectral methods. But each, in turn, generates 
coupling between n differential equations. For instance, spectral methods gen- 
erate coupling in global Fourier modes [420]. 

Ultimately, differentiation schemes result in a general expression of the form 


(14.22) 


Pug 2 
n m YO aP up, (14.23) 
k=l 


where al? ) are pre-computed coefficients from a prescribed differentiation scheme. 


Importantly, error estimates are directly related to the spatial discretization Az. 
The discretization must be small enough to resolve all spatial scales, and thus 
it sets the resolution limit and computational time required to solve the PDE. 
Similarly, discretization of time generates a corresponding At, which is related 
to Az through a Courant—-Friedrichs—Lewy (CFL) condition for the stability of 
a numerical scheme [420]. 
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Figure 14.11: Neural network architecture. During training, the model is op- 
timized to predict cell-average time derivatives or time-evolved solution val- 
ues from cell-average values, based on a pre-computed data set of snapshots 
from high-resolution simulations. During inference, the optimized model is re- 
peatedly applied to predict the time evolution using the method of lines. From 
Bar-Sinai et al. [52]. 


Standard schemes use one set of pre-computed coefficients for all points 
in space, while more sophisticated methods alternate between different sets of 
coefficients according to local rules. Regardless, the governing PDE is 
now the high-dimensional system of ODEs 


Th = N(u, ui, u2, . .. , Un, Z, t; B). (14.24) 
One criticism of this discretized model is the computational cost of simulating 
it if there are significant time and space scales that need to be resolved. 
Bar-Sinai et al. learn directly from data a flexible parameterization of the 
derivative (14.23). The training data in this case are highly resolved simulations 
of the multi-scale dynamics. From such simulations, a model can be learned for 
the coefficients al? ) Indeed, the coefficients al? ) are learned (chosen) in order 
to model the coarse-grained dynamics most accurately. Figure{14.11|shows the 
architecture used to train such a model. The philosophy behind this parameter- 
ization can be carried over to time integration as well. Importantly, the integra- 
tor must be numerically stable and also generalize as best as possible. This is 
achieved in Bar-Sinai et al. by leveraging a multi-layer neural network to 
parameterize the solution manifold. The multi-layer network’s flexibility also 
allows one to impose physical constraints and interpretability through choice 
of model architecture. Details of the neural network and its implementation can 
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Figure [14.12] compares the integration results for a particular realization of 
the forcing of Burgers’ equation for different values of the resampling factor, 
that is, the ratio between the number of grid points in the low-resolution cal- 
culation and that of the fully converged solution. The learned models, with 
both constant and solution-dependent coefficients, can propagate the solution 
in time and dramatically outperform the baseline method at low resolution. 
Importantly, the ringing effect around the shocks, which leads to numerical in- 
stabilities, is practically eliminated. Since the model is trained on fully resolved 
simulations, a crucial requirement for the method to be of practical use is that 
training can be done on small systems, but still produce models that perform 
well on larger ones. This is expected to be the case, since the models, being 
based on convolutional neural networks, use only local features and by con- 
struction are translation-invariant. Figure illustrates the performance of 
the model trained on the domain [0, 27] for predictions on a 10-times larger spa- 
tial domain of size [0,207]. The learned model generalizes well. For example, 
it shows good performance when function values are all positive in a region 
of size greater than 27, which, due to the conservation law, cannot occur in the 
training data set. Overall, the work of Bar-Sinai et al. provides an elegant 
closure solution for the parameterization of fine-scale physics, something that 
is of high value for producing reasonable solution times for multi-scale physics 
systems. 

An alternative deep learning approach uses a multi-resolution convolutional 
autoencoder (MrCAE) architecture that integrates and leverages three highly 
successful mathematical architectures: (i) multi-grid methods, (ii) convolutional 
autoencoders, and (iii) transfer learning. The method provides an adaptive, 
hierarchical architecture that capitalizes on a progressive training approach 
for multi-scale spatio-temporal data. This framework allows for inputs across 
multiple scales: starting from a compact (small number of weights) network 
architecture and low-resolution data, this network progressively deepens and 
widens itself in a principled manner to encode new information in the higher- 
resolution data based on its current performance of reconstruction. Basic trans- 
fer learning techniques are applied to ensure information learned from previ- 
ous training steps can be rapidly transferred to the larger network. As a re- 
sult, the network can dynamically capture different scaled features at different 


depths of the network. For details, see https: //github.com/luckystarufo/ 


MrCAk 
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Figure 14.12: Time integration results for Burgers’ equation. (A) A particu- 
lar realization of a solution at varying resolution solved by the baseline first- 
order finite-volume method, weighted essentially non-oscillatory (WENO), op- 
timized constant coefficients with Godunov flux (Opt. God.), and the neural 
network (NN), with the white region indicating times when the solution di- 
verged. Both learned methods manifestly outperform the baseline method and 
even outperform WENO at coarse resolutions. (B) Inference predictions for the 
32x neural network model, on a 10 times larger spatial domain (only partially 
shown). The box surrounded by the dashed line shows the spatial domain 
used for training. (C) Mean absolute error between integrated solutions and 
the ground truth, averaged over space, times less than 15, and 10 forcing real- 
izations on the 10-times larger inference domain. These metrics almost exactly 
match results on the smaller training domain [0, 27]. As ground truth, we use 
WENO simulations on a 1x grid. Markers are omitted if some simulations di- 
verged or if the average error is worse than fixing u = 0. From Bar-Sinai et al. 


[52]. 
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14.7 Deep Learning and Boundary Value Problems 


Thus far, boundary value problems (BVPs) have not been discussed. How- 
ever, the differential and partial differential equations that represent BVPs are 
amenable to all the data-driven methods discussed thus far. As an example of 
how deep learning can be used in the context of BVPs, we consider the ubiq- 
uitous Green’s function. The Green’s function constructs the solution to a BVP 
for any given forcing by linear superposition. Specifically, consider the classical 
linear BVP 

Llv(x)] = f), (14.25) 


where L is a linear differential operator, f is a forcing, x € Q is the spatial co- 
ordinate, and Q is an open set. The boundary conditions B(x) = 0 are imposed 
on OQ with a linear operator B. The fundamental solution is constructed by 
considering the adjoint equation 


L'[G(x, &)] = 6(x — 6), (14.26) 


where LÌ is the adjoint operator (along with its associated boundary conditions) 
and ô(x— £) is the Dirac delta function. Taking the inner product of (14.25) with 
respect to the Green’s function gives the fundamental solution 


v(x) = (F(€),G(E,x) = f GCE, x) F(E) d£, (14.27) 


which is valid for any forcing f(x). Thus, once the Green’s function is com- 
puted, the solution for arbitrary forcing functions can be extracted from inte- 
gration. This integration represents a superposition of a continuum of delta 
function forcings that are used to represent f(x). The Green’s function was a 
motivating example in the neural operator approach since it 
provides a kernel representation of the solution. Of course, the Green’s function 
only works for linear problems, since superposition of solutions must hold. 
Neural networks can, however, transform our view of Green’s functions. 
Specifically, as already illustrated in previous sections, data-driven modeling 
can jointly learn coordinates and models. Thus, for BVPs, one can learn a coor- 
dinate transformation and kernel representation jointly, which would allow for 
the Green’s function methodology. Thus we can turn nonlinear problems lin- 
ear so as to exploit linear superposition. Indeed, in many modern applications, 
nonlinearity plays a fundamental role so that the BVP is of the form 


Nẹ[u(x)] = F(x), (14.28) 


where N[] is a nonlinear differential operator. For this case, the principle of 
linear superposition no longer holds and the notion of a fundamental solution 
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Figure 14.13: DeepGreen architecture. Two autoencoders learn invertible coor- 
dinate transformations that linearize a nonlinear boundary value problem. The 
latent space is constrained to exhibit properties of a linear system, including 
linear superposition, which enables discovery of a Green’s function for nonlin- 
ear boundary value problems. From Gin et al. [280]. 


is lost. However, modern deep learning algorithms allow us the flexibility of 
learning coordinate transformations (and their inverses) of the form 


v = y(u), (14.29a) 
f=¢(F), (14.29b) 


such that v and f satisfy the linear BVP for which we generated the fun- 
damental solution (14.27). This gives a nonlinear fundamental solution through 
use of this deep learning transformation. 

DeepGreen leverages the success of DNNs for dynamical systems to 
discover coordinate transformations that linearize nonlinear BVPs so that the 
Green’s function solution can be recovered. This allows for the discovery of 
the fundamental solutions for nonlinear BVPs, opening many opportunities for 
the engineering and physical sciences. DeepGreen exploits physics-informed 
learning by using autoenconders (AEs) to take data from the original high- 
dimensional input space to the new coordinates at the intrinsic rank of the un- 
derlying physics 544]. The architecture also leverages the success of 
deep residual networks (DRNs) [317], which enables our approach to efficiently 
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Figure 14.14: Summary of results for three one-dimensional models. The mod- 
els are provided with the Green’s function learned by DeepGreen. A summary 
box plot shows the relative losses for all three model systems. The loss func- 
tions are shown in Fig. [14.13]and are associated with the autoencoder, linearity, 
and cross-mapping. From Gin et al. [280]. 


handle near-identity coordinate transformations [279]. Figure [14.13] highlights 
the deep learning approach which leverages a dual autoencoder architecture. 
DeepGreen transforms a nonlinear BVP to a linear BVP, solves the linearized 
BVP, and then inverse-transforms the linear solution to solve the nonlinear BVP. 
Figure[14.14]highlights the nonlinear Green’s functions found for a number of 
prototype nonlinear BVPs. The success of the algorithm again shows how mul- 
tiple neural networks can be simultaneously trained to produce high-quality 
characterizations of physics-based problems. 
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Suggested Reading and Homework 
Exercise 14-1. Read and reproduce the results of Champion et al. [168]: 
https://github.com/kpchamp/SindyAutoencoders 


Exercise 14-2. Read and reproduce the results of Lange et al. [428]: 
https://github.com/helange23/from_fourier_to_koopman 


Exercise 14-3. Read and reproduce the results of Lu et al. [461]: 
https://github.com/lululxvi/deepxde 


Exercise 14-4. Read and reproduce the results of Kovachki et al. [407]: 
https://github.com/zongyi-li/graph-pde 


https://github.com/zongyi-li/fourier_neural_operator 


Exercise 14-5. Read and reproduce the results of Raissa et al. [584]: 
https://github.com/maziarraissi/PINNs 


Exercise 14-6. Read and reproduce the results of Bar-Sinai et al. [52]: 
https://github.com/google/data-driven-discretization-ld 
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Actor-critic — A reinforcement learning algorithm that simultaneously learns a 
policy function and a value function, with the goal of taking the best from both. 


Adjoint — For a finite-dimensional linear map (i.e., a matrix A), the adjoint 
A* is given by the complex conjugate transpose of the matrix. In the infinite- 
dimensional context, the adjoint A* of a linear operator A is defined so that 
(Af, g) = (f, A*g), where (-,-) is an inner product. 


Agent — In reinforcement learning (RL), an agent senses the state s of its envi- 
ronment and learns to take appropriate actions a to achieve an optimal future 
reward r. 


Akaike information criterion (AIC) — An estimator of the relative quality of 
statistical models for a given set of data. Given a collection of models for the 
data, AIC estimates the quality of each model, relative to each of the other mod- 
els. Thus, AIC provides a means for model selection. 


Autoencoder — Autoencoders are a class of machine learning models that are 
used to learn efficient latent codings of unlabeled data (unsupervised learning). 
Autoencoders learn efficient codings by performing nonlinear dimensionality 
reduction. Autoencoders are typically trained with both an encoding layer and 
a decoding layer so that one can map to the latent representation and back. 


Backpropagation (backprop) — A method used for computing the gradient de- 
scent required for the training of neural networks (NNs). Based upon the chain 
rule, backprop exploits the compositional nature of NNs in order to frame an 
optimization problem for updating the weights of the network. It is commonly 
used to train deep neural networks (DNNSs). 


Balanced input-output model — A model expressed in a coordinate system 
where the states are ordered hierarchically in terms of their joint controllability 
and observability. The controllability and observability Gramians are equal and 
diagonal for such a system. 
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Bayesian information criterion (BIC) — An estimator of the relative quality of 
statistical models for a given set of data. Given a collection of models for the 
data, BIC estimates the quality of each model, relative to each of the other mod- 
els. Thus, BIC provides a means for model selection. 


Bellman optimality — A cornerstone of dynamic programming, stating that an 
optimal multi-step sequence must also be locally optimal in every sub-sequence 
of steps. 


Classification — A general process related to categorization, the process in which 
ideas and objects are recognized, differentiated, and understood. Classification 
is acommon task for machine learning algorithms. 


Closed-loop control — A control architecture where the actuation is informed 
by sensor data about the output of the system. 


Clustering — A task of grouping a set of objects in such a way that objects in 
the same group (called a cluster) are more similar (in some sense) to each other 
than to those in other groups (clusters). It is a primary goal of exploratory data 
mining, and a common technique for statistical data analysis. 


Coherent structure — A spatial mode that is correlated with the data from a 
system. 


Compressed sensing — The process of reconstructing a high-dimensional vec- 
tor signal from a random undersampling of the data using the fact that the 
high-dimensional signal is sparse in a known transform basis, such as the Fourier 
basis. 


Compression — The process of reducing the size of a high-dimensional vec- 
tor or array by approximating it as a sparse vector in a transformed basis. For 
example, MP3 and JPG compression use the Fourier basis or wavelet basis to 
compress audio or image signals. 


Control theory — The framework for modifying a dynamical system to conform 
to desired engineering specification through sensing and actuation. 


Controllability — A system is controllable if it is possible to steer the system to 
any state with actuation. Degrees of controllability are determined by the con- 


trollability Gramian. 


Convex optimization — An algorithmic framework for minimizing convex func- 
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tions over convex sets. 


Convolutional neural network (CNN) — A class of deep, feedforward neural 
networks that is especially amenable to analyzing natural images. The convo- 
lution is typically a spatial filter which synthesizes local (neighboring) spatial 
information. 


Cross-validation — A model validation technique for assessing how the results 
of a statistical analysis will generalize to an independent (withheld) data set. 


Data matrix — A matrix where each column vector is a snapshot of the state of 
a system at a particular instant in time. These snapshots may be sequential in 
time, or they may come from an ensemble of initial conditions or experiments. 


Deep learning — A class of machine learning algorithms that typically uses 
deep convolutional neural networks (CNNs) for feature extraction and trans- 
formation. Deep learning can leverage supervised (e.g., classification) and/or 
unsupervised (e.g., pattern analysis) algorithms, learning multiple levels of 
representations that correspond to different levels of abstraction; the levels form 
a hierarchy of concepts. 


Deep reinforcement learning — Reinforcement learning algorithms that lever- 
age deep neural networks, such as deep policy networks and deep Q-learning. 


DeepONet — DeepONets are a class of machine learning models for sequential 
data typically generated by a deterministic dynamical system. Specifically, a 
DeepONet learns the underlying operator associated with the dynamical sys- 
tem or PDE. DeepONets train two neural networks simultaneously: one for en- 
coding the input function at a fixed number of sensors/measurement locations 
(branch net), and another for encoding the locations for the output function 
(trunk net). 


DMD amplitude — The amplitude of a given DMD mode (see Dynamic mode 
decomposition) as expressed in the data. These amplitudes may be interpreted 
as the significance of a given DMD mode, similar to the power spectrum in the 
fast Fourier transform (FFT). 


DMD eigenvalue — Eigenvalues of the best-fit DMD operator A (see Dynamic 
mode decomposition) representing an oscillation frequency and a growth or de- 


cay term. 


DMD mode (also dynamic mode) — An eigenvector of the best-fit DMD opera- 
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tor A (see Dynamic mode decomposition). These modes are spatially coherent and 
oscillate in time at a fixed frequency and a growth or decay rate. 


Dynamic mode decomposition (DMD) — The leading eigendecomposition of 
a best-fit linear operator A = X’X' that propagates the data matrix X into a 
future data matrix X’. The eigenvectors of A are DMD modes and the corre- 
sponding eigenvalues determine the time dynamics of these modes. 


Dynamic programming — A powerful optimization approach used extensively 
for optimal nonlinear control and reinforcement learning. Dynamic program- 
ming reformulates large multi-step optimization problems into a recursive op- 
timization of smaller sub-problems, relying on Bellman’s principle of optimal- 


ity. 


Dynamical system — A mathematical model for the dynamic evolution of a 
system. Typically, a dynamical system is formulated in terms of ordinary dif- 
ferential equations (ODEs) on a state space. The resulting equations may be 
linear or nonlinear and may also include the effect of actuation inputs and rep- 
resent outputs as sensor measurements of the state. 


Eigensystem realization algorithm (ERA) — A system identification technique 
that produces balanced input-output models of a system from impulse-response 
data. ERA has been shown to produce equivalent models to balanced proper 
orthogonal decomposition (BPOD) and dynamic mode decomposition (DMD) 
under some circumstances. 


Emission — The measurement functions for a hidden Markov model. 


Environment — The external system or world in which a reinforcement learn- 
ing agent takes actions to interact with. Often, the environment is a Markov 
decision process. 


Fast Fourier transform (FFT) — A numerical algorithm to compute the discrete 
Fourier transform (DFT) in O(n log(n)) operations. The FFT has revolutionized 
modern computations, signal processing, compression, and data transmission. 


Feedback control — Closed-loop control where sensors measure the downstream 
effect of actuators, so that information is fed back to the actuators. Feedback is 
essential for robust control where model uncertainty and instability may be 


counteracted with fast sensor feedback. 


Feedforward control — Control where sensors measure the upstream distur- 
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bances to a system, so that information is fed forward to actuators to cancel 
disturbances proactively. 


Fourier transform — A change of basis used to represent a function in terms of 
an infinite series of sines and cosines. 


Galerkin projection — A process by which governing partial differential equa- 
tions (PDEs) are reduced into ordinary differential equations (ODEs) in terms 
of the dynamics of the coefficients of a set of orthogonal basis modes that are 
used to approximate the solution. 


Gated recurrent unit (GRU) — GRUs are a class of machine learning models 
for sequential data. GRUs are a subset of RNNs and LSTMs, but with a forget 
gate and with fewer parameters than an LSTM, since a GRU does not have an 
output gate. 


Generative adversarial network (GAN) — GANs are a class of machine learn- 
ing models that learn to generate new data with the same statistics as the train- 
ing set. GANs include a generative network that learns to map from a latent 
space to a data distribution of interest, while a second discriminator network 
classifies data candidates produced by the generator from the true data distri- 
bution. The generative network’s training objective is to increase the error rate 
of the discriminative network. Thus the generator can fool the discriminator 
network by producing novel candidates that the discriminator thinks are not 
synthesized data. 


Gramian — The controllability (respectively, observability) Gramian determines 
the degree to which a state is controllable (respectively, observable) via actua- 
tion (respectively, estimation). The Gramian establishes an inner product on the 
state space. 


Hidden Markov model (HMM) - A Markov model where there is a hidden 
state that is only observed through a set of measurements known as emissions. 


Hilbert space — A generalized vector space with an inner product. When re- 
ferred to in this text, a Hilbert space typically refers to an infinite-dimensional 
function space. These spaces are also complete metric spaces, providing a suf- 
ficient mathematical framework to enable calculus on functions. 


Hindsight experience replay — The process of learning from past experiences 
in off-policy reinforcement learning algorithms, such as Q-learning. 
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Imitation learning — The process of learning from other more experienced 
agents in off-policy reinforcement learning algorithms, such as Q-learning. 


Incoherent measurements — Measurements that have a small inner product 
with the basis vectors of a sparsifying transform. For instance, single-pixel mea- 
surements (i.e., spatial delta functions) are incoherent with respect to the spa- 
tial Fourier transform basis, since these single-pixel measurements excite all 
frequencies and do not preferentially align with any single frequency. 


Kalman filter — An estimator that reconstructs the full state of a dynamical sys- 
tem from measurements of a time series of the sensor outputs and actuation 
inputs. A Kalman filter is itself a dynamical system that is constructed for ob- 
servable systems to stably converge to the true state of the system. The Kalman 
filter is optimal for linear systems with Gaussian process and measurement 
noise of a known magnitude. 


Koopman eigenfunction — An eigenfunction of the Koopman operator. These 
eigenfunctions correspond to measurements on the state space of a dynamical 
system that form intrinsic coordinates. In other words, these intrinsic measure- 
ments will evolve linearly in time despite the underlying system being nonlin- 
ear. 


Koopman operator — An infinite-dimensional linear operator that propagates 
measurement functions from an infinite-dimensional Hilbert space through a 
dynamical system. 


Laplace transform — A generalization of the Fourier transform for a larger class 
of functions that are not Lebesgue-integrable, such as exponential functions. 
The Laplace transform may be thought of as a weighted, one-sided Fourier 
transform for badly behaved functions. 


Least-squares regression — A regression technique where a best-fit line or vec- 
tor is found by minimizing the sum of squares of the error between the model 
and the data. 


Linear—quadratic regulator (LOR) — An optimal proportional feedback con- 
troller for full-state feedback, which balances the objectives of regulating the 
state while not expending too much control energy. The proportional gain ma- 
trix is determined by solving an algebraic Riccati equation. 


Linear system — A system where superposition of any two inputs results in the 
superposition of the two corresponding outputs. In other words, doubling the 
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input doubles the output. Linear time-invariant dynamical systems are charac- 
terized by linear operators, which are represented as matrices. 


Long short-term memory (LSTM) — LSTMs are a class of machine learning 
models for sequential data. LSTMs are a subset of RNNs, with specific filter- 
ing functions for improving sequential (temporal) modeling. A common LSTM 
unit is composed of a cell, an input gate, an output gate, and a forget gate. The 
cell remembers values over arbitrary time intervals and the three gates regulate 
the flow of information into and out of the cell. 


Low rank — A property of a matrix where the number of linearly independent 
rows and columns is small compared with the size of the matrix. Generally, 
low-rank approximations are sought for large data matrices. 


Machine learning - A set of statistical tools and algorithms that are capable of 
extracting the dominant patterns in data. The data mining can be supervised or 
unsupervised, with the goal of clustering, classification and prediction. 


Markov decision process (MDP) — A common environment in reinforcement 
learning, in which the probability of the system being in the next state is deter- 
mined entirely by the current state and action. 


Markov model — A probabilistic dynamical system where the state vector con- 
tains the probability that the system will be in a given state; thus, this state 
vector must always sum to unity. The dynamics are given by the Markov tran- 
sition matrix, which is constructed so that each row sums to unity. 


Markov parameters — The output measurements of a dynamical system in re- 
sponse to an impulsive input. 


Max pooling — A data downsampling strategy whereby an input representa- 
tion (image, hidden-layer output matrix, etc.) is reduced in dimensionality, thus 
allowing for assumptions to be made about features contained in the downsam- 
pled sub-regions. 


Model predictive control (MPC) — A form of optimal control that optimizes a 
control policy over a finite-time horizon, based on a model. The models used 
for MPC are typically linear and may be determined empirically via system 
identification. 


Moore’s law — The observation that transistor density, and hence processor 
speed, increases exponentially in time. Moore’s law is commonly used to pre- 
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dict future computational power and the associated increase in the scale of 
problem that will be computationally feasible. 


Multi-scale — The property of having many scales in space and/or time. Many 
systems, such as turbulence, exhibit spatial and temporal scales that vary across 
many orders of magnitude. 


Neural operator — Neural operators are a class of machine learning models 
for sequential data typically generated by a deterministic dynamical system. 
Like DeepONet, neural networks are tailored to learn operators by mapping 
between infinite-dimensional function spaces. 


Observability — A system is observable if it is possible to estimate any system 
state with a time history of the available sensors. Degrees of observability are 
determined by the observability Gramian. 


Observable function — A function that measures some property of the state of 
a system. Observable functions are typically elements of a Hilbert space. 


Off-policy reinforcement learning — Reinforcement learning algorithms, such 
as Q-learning, where the agent is able to take sub-optimal actions while learn- 
ing, enabling imitation learning and hindsight replay. 


On-policy reinforcement learning — Reinforcement learning algorithms where 
the agent must take the best action according to its current policy as it learns. 


Optimization — Generally a set of algorithms that find the “best available” val- 
ues of some objective function given a defined domain (or input), including a 
variety of different types of objective functions and different types of domains. 
Mathematically, optimization aims to maximize or minimize a real function by 
systematically choosing input values from within an allowed set and comput- 
ing the value of the function. The generalization of optimization theory and 
techniques to other formulations constitutes a large area of applied mathemat- 
ics. 


Over-determined system — A system Ax = b where there are more equations 
than unknowns. Usually, there is no exact solution x to an over-determined 
system, unless the vector b is in the column space of A. 


Pareto front — The allocation of resources from which it is impossible to reallo- 


cate so as to make any one individual or preference criterion better off without 
making at least one individual or preference criterion worse off. 
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Perron—Frobenius operator — The adjoint of the Koopman operator, the Perron- 
Frobenius operator is an infinite-dimensional operator that advances probabil- 
ity density functions (PDFs) through a dynamical system. 


Physics-informed neural network (PINN) — PINNs are a class of machine 
learning models for sequential data typically generated by a deterministic dy- 
namical system. PINNs enforce governing equations, initial conditions, and 
boundary conditions in the training loss function. Thus the neural network is 
trained to solve a supervised learning task while respecting any given laws of 
physics described by general nonlinear partial differential equations. 


Policy iteration — A form of dynamic programming in which the policy func- 
tion and value function are iteratively updated, while the other is held fixed. 


Policy function — A set of rules about what action an agent should take given 
the current state of the environment. 


Power spectrum — The squared magnitude of each coefficient of a Fourier trans- 
form of a signal. The power corresponds to the amount of each frequency re- 
quired to reconstruct a given signal. 


Principal component — A spatially correlated mode in a given data set, often 
computed using the singular value decomposition (SVD) of the data after the 
mean has been subtracted. 


Principal component analysis (PCA) — A decomposition of a data matrix into a 
hierarchy of principal component vectors that are ordered from most correlated 
to least correlated with the data. PCA is computed by taking the singular value 
decomposition (SVD) of the data after subtracting the mean. In this case, each 
singular value represents the variance of the corresponding principal compo- 
nent (singular vector) in the data. 


Proper orthogonal decomposition (POD) — The decomposition of data from a 
dynamical system into a hierarchical set of orthogonal modes, often using the 
singular value decomposition (SVD). When the data consists of velocity mea- 
surements of a system, such as an incompressible fluid, then the POD orders 
modes in terms of the amount of energy these modes contain in the given data. 


Pseudo-inverse — The pseudo-inverse generalizes the matrix inverse for non- 


square matrices, and is often used to compute the least-squares solution to a 
system of equations. The singular value decomposition (SVD) is a common 
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method to compute the pseudo-inverse: given the SVD X = UX V*, the pseudo- 
inverse is Xt = VEU”. 


Q learning — A leading model-free reinforcement learning algorithm based on 
the quality function Q(s, a). 


Quality function — The joint quality of being in a particular state s and taking 
a given action a. The quality function Q(s,a) extends the value function, by- 
passing the need to know the next optimal state s’ and providing the basis for 
Q-learning. 


Recurrent neural network (RNN) — RNNs are a class of machine learning mod- 
els for sequential data, typically a temporal sequence, where connections be- 
tween nodes form a directed or undirected graph along the sequence. Although 
similar to feedforward neural networks, RNNs use their internal state (mem- 
ory) to process variable-length sequences of inputs. 


Reduced-order model (ROM) - A model of a high-dimensional system in terms 
of a low-dimensional state. Typically, a reduced-order model balances accuracy 
with computational cost of the model. 


Regression — A statistical model that represents an outcome variable in terms of 
indicator variables. Least-squares regression is a linear regression that finds the 
line of best fit to data; when generalized to higher dimensions and multi-linear 
regression, this generalizes to principal components regression. Nonlinear re- 
gression, dynamic regression, and functional or semantic regression are used 
in system identification, model reduction, and machine learning. 


Reinforcement learning (RL) — A major branch of machine learning that is con- 
cerned with how to learn control laws and policies to interact with a complex 
environment from experience. 


Restricted isometry property (RIP) — The property that a matrix acts like a 
unitary matrix, or an isometry map, on sparse vectors. In other words, the dis- 
tance between any two sparse vectors is preserved if these vectors are mapped 
through a matrix that satisfies the restricted isometry property. 


Reward — A positive reinforcement signal in reinforcement learning (RL), to be 
maximized by an agent's policy 7(s, a). 


Reward shaping — The process of constructing a customized proxy reward sig- 
nal that may be used to improve the learning rate for systems with sparse re- 
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wards. 


Robust control — A field of control that penalizes worst-case scenario control 
outcomes, thus promoting controllers that are robust to uncertainties, distur- 
bances, and unmodeled dynamics. 


Robust statistics - Methods for producing good statistical estimates for data 
drawn from a wide range of probability distributions, especially for distribu- 
tions that are not normal and where outliers compromise predictive capabili- 
ties. 


SARSA - SARSA (state—action—reward-state—action) learning is a form of on- 
policy temporal difference learning closely related to Q-learning. 


Singular value decomposition (SVD) — Given a matrix X € C”*™, the SVD is 
given by X = UXV*, where U € C"*”, 5 c C™™, and V e C™*™, The matri- 
ces U and V are unitary, so that UU* = U*U = I and VV* = V*V =I. The 
matrix X has entries along the diagonal corresponding to the singular values 
that are ordered from largest to smallest. This produces a hierarchical matrix 
decomposition that splits a matrix into a sum of rank-one matrices given by 
the outer product of a column vector (left singular vector) with a row vector 
(conjugate transpose of a right singular vector). These rank-one matrices are 
ordered by the singular value, so that the first r rank-one matrices form the best 
rank-r matrix approximation of the original matrix in a least-squares sense. 


Snapshot — A single high-dimensional measurement of a system at a particular 
time. A number of snapshots collected at a sequence of times may be arranged 
as column vectors in a data matrix. 


Sparse identification of nonlinear dynamics (SINDy) — A nonlinear system 
identification framework used to simultaneously identify the nonlinear struc- 
ture and parameters of a dynamical system from data. Various sparse optimiza- 
tion techniques may be used to determine SINDy models. 


Sparsity — A vector is sparse if most of its entries are zero or nearly zero. Spar- 
sity refers to the observation that most data are sparse when represented as 
vectors in an appropriate transformed basis, such as Fourier or proper orthog- 
onal decomposition (POD) bases. 


Spectrogram — A short-time Fourier transform computed on a moving win- 


dow, which results in a time-frequency plot of which frequencies are active at 
a given time. The spectrogram is useful for characterizing non-periodic signals, 
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where the frequency content evolves over time, as in music. 


State space — The set of all possible system states. Often the state space is a vec- 
tor space, such as R”, although it may also be a smooth manifold M. 


Stochastic gradient descent — Also known as incremental gradient descent, it 
allows one to approximate the gradient with a single data point instead of all 
available data. At each step of the gradient descent, a randomly chosen data 
point is used to compute the gradient direction. 


System identification — The process by which a model is constructed for a sys- 
tem from measurement data, possibly after perturbing the system. 


Temporal difference error — The difference between the estimated future re- 
ward (i.e., the target) and the actual future reward, which is used to update the 
value or quality function in TD learning. 


Temporal difference (TD) learning — A sample-based reinforcement learning 
strategy, in which the current value or quality function is updated based on the 
rewards obtained in the subsequent events. TD learning is designed to mimic 
the learning process in animals. 


Temporal difference target — The estimated future reward in TD learning. 


Time delay coordinates — An augmented set of coordinates constructed by con- 
sidering a measurement at the current time along with a number of times in 
the past at fixed intervals from the current time. Time delay coordinates are 
often useful in reconstructing attractor dynamics for systems that do not have 
enough measurements, as in the Takens embedding theorem. 


Total least-squares — A least-squares regression algorithm that minimizes the 
error on both the inputs and the outputs. Geometrically, this corresponds to 
finding the line that minimizes the sum of squares of the total distance to all 
points, rather than the sum of squares of the vertical distance to all points. 


Uncertainty quantification (UQ) — The principled characterization and man- 
agement of uncertainty in engineering systems. Uncertainty quantification of- 
ten involves the application of powerful tools from probability and statistics to 
dynamical systems. 


Under-determined system — A system Ax = b where there are fewer equations 
than unknowns. Generally, the system has infinitely many solutions x unless b 
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is not in the column space of A. 


Unitary matrix — A matrix whose complex conjugate transpose is also its in- 
verse. All eigenvalues of a unitary matrix are on the complex unit circle, and 
the action of a unitary matrix may be thought of as a change of coordinates that 
preserves the Euclidean distance between any two vectors. 


Value function — A function quantifying the desirability of being in a given 
state s, as calculated by the discounted sum of future rewards, given an op- 
timal policy starting from this state. The value function is often written in a 
recursive form, based on Bellman’s equation. 


Value iteration — A form of dynamic programming that iteratively updates the 
value function, after which an optimal policy may be extracted. 


Wavelet — A generalized function, or family of functions, used to generalize the 
Fourier transform to approximate more complex and multi-scale signals. 
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